Method of constructing preferred views of hierarchical data

ABSTRACT

Disclosed is a method of constructing at least one data structure from at least one data source. A representation is constructed of the data source and at least one previous view of the data source. From the representations, at least one compulsory entity (eg. “branch”) is then identified. This may generally be performed by a user selection. The method then constructs the data structure comprising the compulsory entity and one or more context entities, where the context entities are obtained from the representation and context data obtained from the previous view. Typically the data source is hierarchical and the data structure is hierarchical.

COPYRIGHT NOTICE

This patent specification contains material that is subject to copyrightprotection. The copyright owner has no objection to the reproduction ofthis patent specification or related materials from associated patentoffice files for the purposes of review, but otherwise reserves allcopyright whatsoever.

TECHNICAL FIELD OF THE INVENTION

The present invention relates to the general field of informationretrieval and, in particular, to the automatic identification andretrieval of relevant data from large hierarchical data sources.

BACKGROUND

Extensible Markup Language (XML) is increasingly becoming a popularhierarchical format for storing and exchanging information. Whilst thehierarchical nature of XML makes it an excellent means for capturingrelationships between data objects, it also makes keyword searching moredifficult.

Keyword searching is of particular importance when dealing with astructured data format such as XML because it allows the user to locateparticular keywords quickly without the need to know the internalstructure of the data. It is a challenge when working with XML becausethere is no optimal or clearly preferred method for presenting theresult of a keyword search. In the traditional unstructured textenvironment, the data system typically presents the user with thelocated keywords together with other text in their vicinity. If thereare more than one ‘hit’, then the neighbouring text provides a usefulcontext for distinguishing between hits, thereby allowing the user toquickly select the most relevant hit according to user's needs.

In the structured XML environment, there is no clear concept of‘neighbouring data’ since data that are related to one another mayreside at several disjoint locations within an XML document. Thus it isdifficult to identify or construct a suitable context for a hit in akeyword search. Consequently, most existing XML based systems simplyreturn an entire XML document (out of a collection of XML documents) ifa keyword hit occurs within the document, with the entire documenteffectively serving as the context for the hit. This is undesirable whendocuments are large and the user is not interested in seeing all oftheir contents.

Practical data sources, especially databases, often contain much moredata than a user typically wants to see at any one time. For example, adatabase in a mail order store may contain details about all of itsproduct lines, customers, suppliers, couriers, and lists of past andpending orders. A store clerk may at one time wish to see the currentstock level for a particular product, and at another time may want tocheck the status of an order for a customer. A store manager on theother hand may wish to see the variation of the total sales for aparticular product line over a number of months. In each of these casesit would be too distracting to the user if an avalanche of additionalirrelevant data were to be also presented. Further, unless the user isfamiliar with the structure of the database, the user would typically beunable to identify information about which the user has an interest.

The traditional method for providing only relevant data is through theuse of pre-created “views”, prepared by someone who is familiar with thestructure of the data source, such as a system administrator. Each viewdraws together some subset of the data source and is tailored for adistinct purpose. In the previously given examples, the store clerkwould consult a “stock level” view or an “order status” view, whilst themanager would bring up a “sales” view.

Whilst this approach of using pre-created views may be satisfactory whenall likely usage scenarios can be anticipated, it is inadequate forkeyword searching. In a keyword search operation, a user enters one ormore keywords and the system responds with a data set or view thatincludes occurrences of all keywords (assuming an AND Boolean keywordsearch operation). In a hierarchical environment such as XML, keywordhits may occur in several data items residing at different locations inthe hierarchy. Since it is not feasible to anticipate all possiblekeyword combinations that a user may provide, it is not possible topre-determine where in the hierarchy hits will occur. Consequently it isnot possible to provide pre-created views that will cater for all searchscenarios.

An analogous keyword searching problem also exists in the relationaldatabase environment. A relational database comprises tables joinedthrough their primary and foreign keys, where each table comprises aplurality of rows each denoting an n-tuple of attribute values for someentity. A traditional solution to keyword searching in a relationaldatabase, described by Hristidis, V. and Papakonstantinou, Y.,“DISCOVER: Keyword Search in Relational Databases”, Proceedings of the28th VLDB Conference, 2002, is to return a minimal joining network,which is the smallest network of joined rows across joined tables thatcontain all keyword hits. A problem with this approach is that iteffectively treats rows as the smallest data “chunks” in that if akeyword hit occurs any where in a row of a database table then theentire row is returned as context for the hit. This may lead toexcessive amounts of data being presented to the user since a typicalrelational database table often contains many columns that are notusually of interest to the user.

Further, adapting the above technique to hierarchical data structuressuch as XML may result in insufficient context information. In ahierarchical environment, related data may be stored at different levelsin the hierarchy, and thus often data stored in a parent or ancestornode or their children may provide very useful context for a keywordhit, even though these may not be included in the minimal data set.

some attempts have been made to address the keyword searching problem inhierarchical data. Florescu, D. et al, “integrating Keyword Search intoXML Query Processing”, Ninth International World Wide Web Conference,May 2000, discloses a method of augmenting a structural query languagewith a keyword searching operator contains. This operator evaluates toTRUE if a specified sub-tree contains some specified keywords. The usercan use this operator when constructing queries to filter out unwanteddata. Whilst this useful feature does not require the user to specifythe exact location of hit keywords within a given sub-tree, it does notgo far enough since the user is still required to specify the exactformat of the returned data in the search query and hence the user wouldstill need to be familiar with the structure of the data source. Inother words, free-text keyword searching is still not possible, unlessthe user is willing to accept an entire data source as a result of thesearch.

Another existing approach to keyword searching in an XML data sourcerequires the user to select from a given list of schema elements, theelement representing the root node of the returned data. If a keywordhit occurs in a descendant node of a data element represented by theselected schema element, then the entire sub-tree below the dataelement, containing the hit keyword, is returned to the user. Thisapproach is cumbersome because it requires user interventions.Furthermore, the user is forced to accept an entire sub-tree even thoughit may contain data not of interest to the user.

Accordingly, there is a need for a method for determining a set ofrelevant data in a hierarchical data environment in response to akeyword search operation involving arbitrary combinations of keywordsthat does not require user interventions or prior user knowledge of thestructure of the hierarchical data.

SUMMARY

It is an object of the present invention to substantially overcome, orat least ameliorate, one or more disadvantages of existing methods.

In accordance with one aspect of the present invention, there isdisclosed a method of presenting data from at least one data source,said method comprising the steps of:

(i) holding a representation of said least one data source and at leastone previous view of said least one data source;

(ii) identifying at least one compulsory entity in said representation;and

(iii) presenting data structure said least one compulsory entity and oneor more context entities, where said context entities are obtained fromsaid representation and context data obtained from said least oneprevious view.

More specifically disclosed is a method of constructing at least onedata structure from at least one data source, said method comprising thesteps of:

(i) constructing a representation of said least one data source and atleast one previous view of said least one data source;

(ii) identifying at least one compulsory entity in said representation;and

(iii) constructing said at least one data structure comprising saidleast one compulsory entity and one or more context entities, where saidcontext entities are obtained from said representation and context dataobtained from said least one previous view.

Typically one or both of the data sources and the data structures arehierarchical.

In accordance with another aspect of the present invention, there isdisclosed a method of selecting data from a data source, said methodcomprising the steps of

(i) forming a graphical representation of said data source;

(ii) detecting a user selection of part of said representation;

(iii) selecting a set of components in said user-selected part based onan occurrence probability of said set of components given a component ofsaid user-selected part.

In accordance with another aspect of the present invention, there isdisclosed a method of construction and presentation of data for akeyword searching operation in at least one data source involving atleast one search keyword, said method comprising the steps of:

(i) constructing a graphical representation of said least one datasource and at least one previous view of said least one data source;

(ii) identifying at least one compulsory entity in said graphicalrepresentation, where said compulsory entity is a node in said graphicalrepresentation representing a location of one or more said least onesearch keyword;

(iii) constructing at least one data structure comprising said least onecompulsory entity and one or more context entities, where said contextentities are obtained from said graphical representation and contextdata obtained from said least one previous view; and

(iv) presenting said least one data structure as result of said keywordsearching operation.

In accordance with another aspect of the present invention, there isdisclosed a method of presentation of data sourced from a sub-tree of ahierarchically-presented data, said method comprising the steps of

(i) selecting a set of descendant nodes in said sub-tree based oncontext data obtained from at least one previous presentation of saidhierarchically-presented data; and

(ii) constructing and presenting a hierarchical data structurecomprising a root node of said sub-tree and said selected set ofdescendant nodes.

Other aspects of the present invention, including apparatus and computermedia, are also disclosed.

BRIEF DESCRIPTION OF THE DRAWINGS

At least one embodiment of the present invention will now be describedwith reference to the drawings in which:

FIG. 1 is an example schema graph;

FIG. 2 is a flowchart of a keyword searching method;

FIGS. 3A and 3B show two example parent nodes in a schema graph;

FIG. 4 is a diagram of a network of server and client computers;

FIG. 5 is a flowchart of a method for identifying context nodes among aset of child nodes of a parent node not lying along a directed path fromthe root node to a hit node;

FIG. 6 is a flowchart of another method for identifying context nodesamong a set of child nodes of a parent node not lying along a directedpath from the root node to a hit node;

FIG. 7 is a flowchart of a method for identifying context nodes among aset of child nodes of a parent node lying along a directed path from theroot node to a hit node;

FIG. 8 is a flowchart of another method for identifying context nodesamong a set of child nodes of a parent node lying along a directed pathfrom the root node to a hit node;

FIG. 9 is an example schema graph with two identical sub-trees;

FIG. 10 is a flowchart of the first, bottom-up traversal phase of thecontext node identification method with probability averaging;

FIG. 11 is an example of a schema graph with multiple hit nodes;

FIG. 12 is a flowchart of the first, bottom-up traversal phase of thecontext node identification method with probability averaging andinvolving multiple hit nodes;

FIG. 13 is a flowchart of a method for identifying context nodes among aset of child nodes of a parent node not lying along a directed path fromthe root node to a hit node, with probability averaging;

FIG. 14 is a flowchart of a method for identifying context nodes among aset of child nodes of a parent node lying along a directed path from theroot node to a hit node, with probability averaging;

FIG. 15 is an example of a parent node whose descendant hit nodes areall located under a single child node;

FIG. 16 is a flowchart of a method for identifying context nodes among aset of child nodes of a parent node lying along a directed path from theroot node to a hit node, with probability averaging, for the case whereall multiple hit nodes are located under a single child node;

FIG. 17 is a flowchart of a method for identifying context nodes among aset of child nodes of a parent node lying along a directed path from theroot node to a hit node, with probability averaging, for the case wherehit nodes are located under multiple child nodes;

FIG. 18 is a flowchart of a method for identifying context nodes inwhich one or multiple hit nodes may be present;

FIG. 19 is a flowchart of a method for constructing context trees forcases involving a single hit node;

FIG. 20 is a flowchart of a method for constructing context trees forcases involving multiple hit nodes;

FIG. 21 is a flowchart of a method for constructing an alternative setof hit nodes that have higher observation frequencies than those in anoriginal set of hit nodes;

FIG. 22 is a flowchart of a method for selecting an ancestor of a set ofhit nodes that has a higher observation frequency than the set of hitnodes;

FIG. 23 is an example schema graph;

FIG. 24 is a schema graph of an example data view;

FIG. 25 is a schema graph of another example data view;

FIG. 26 is a schema graph of yet another example data view;

FIG. 27 is an occurrence frequency table arising from the data views inFIGS. 24, 25 and 26;

FIG. 28 is a co-occurrence frequency table arising from the data viewsin FIGS. 24, 25 and 26;

FIG. 29 is a leaf co-occurrence frequency table arising from the dataviews in FIGS. 24, 25 and 26;

FIG. 30 is a sole child co-occurrence frequency table arising from thedata views in FIGS. 24, 25 and 26;

FIG. 31 is a portion of a joint-occurrence frequency table arising fromthe data views in FIGS. 24, 25 and 26;

FIG. 32 is another portion of a joint-occurrence frequency table arisingfrom the data views in FIGS. 24, 25 and 26;

FIG. 33 is yet another portion of a joint-occurrence frequency tablearising from the data views in FIGS. 24, 25 and 26;

FIG. 34 is yet another portion of a joint-occurrence frequency tablearising from the data views in FIGS. 24, 25 and 26;

FIG. 35 is yet another portion of a joint-occurrence frequency tablearising from the data views in FIGS. 24, 25 and 26;

FIG. 36 is the schema graph of a context tree returned as a result of akeyword search operation involving two keywords;

FIG. 37 is a schematic block diagram of a general purpose computer uponwhich the arrangements described may be practiced;

FIG. 38 is a flowchart of a sub-process within the method forconstructing context trees for cases involving a single hit nodedepicted in FIG. 19; and

FIG. 39 is a flowchart of a sub-process within the method forconstructing context trees for cases involving multiple hit nodesdepicted in FIG. 20.

DETAILED DESCRIPTION INCLUDING BEST MODE

The present disclosure provides a method for determining a set ofrelevant data in a hierarchical data environment in response to akeyword search operation involving one or more keywords. A preferredimplementation includes a Bayesian probabilistic based method thatconstructs preferred views of data in a hierarchical data structurebased on how data is accessed in past episodes. More specifically, themethod makes use of the frequencies of past joint-occurrences betweenpairs and vectors of data items to compute the probability that a dataitem is relevant to some other compulsory data items. Typically, thecompulsory data items are those containing keyword hits, and thus mustbe returned to the user in the keyword search results. If anon-compulsory data item has a high probability of being relevant to acompulsory data item, then it is likely to be returned in the searchresults to serve as context for the keyword hits.

A distinguishing feature of the presently disclosed arrangements withrespect to traditional pre-created view based approaches is that theformer is able to synthesise new views, rather than merely returning anexisting stored view. Such arrangements are thus able to handle keywordsearch operations involving arbitrary keyword combinations, and sinceviews are dynamically generated, they can be better tailored toindividual operations than those obtained from a fixed pool ofpre-created views.

The presently disclosed methods typically construct a number ofalternative views, and assign a score for each view, signifying how muchthe view may be of interest to the user. In one implementation, a singleview that has the highest score among those constructed is returned tothe user. In an alternative implementation, a list of views is returned,sorted according to their scores, from highest to lowest.

Although keyword searching is its primary motivation, the presentlydisclosed methods can also be used to enhance a method of presentationof hierarchical data, such as that described in Australian PatentApplication No. 2003204824 filed 19 Jun. 2003 and corresponding U.S.patent application Ser. No. 10/465,222 filed 20 Jun. 2003, both entitled“Methods for Interactively Defining Transforms and for GeneratingQueries by Manipulating Existing Query Data. In that publication, amethod for selecting the most appropriate presentation type (such astables, graphs, plots, tree, etc . . . ) based on the structure andcontents of a hierarchical data source is disclosed. That method can beenhanced by incorporating a preferred implementation of the presentdisclosure as a means for automatically selecting a most preferredsubset of the data source for display, prior to the selection ofpresentation type. It is often useful to display only a preferred subsetof data in this way since hierarchical data sources often contain moreinformation than what would normally be of interest to the user, andhence a method for filtering out ‘uninteresting’ data such as thepreferred embodiment of the present invention can help to make theuser's experience more satisfying and productive.

some portions of the description which follows are explicitly orimplicitly presented in terms of algorithms and symbolic representationsof operations on data within a computer memory. These algorithmicdescriptions and representations are the means used by those skilled inthe data processing arts to most effectively convey the substance oftheir work to others skilled in the art. An algorithm is here, andgenerally, conceived to be a self-consistent sequence of steps leadingto a desired result. The steps are those requiring physicalmanipulations of physical quantities. Usually, though not necessarily,these quantities take the form of electrical or magnetic signals capableof being stored, transferred, combined, compared, and otherwisemanipulated. It has proven convenient at times, principally for reasonsof common usage, to refer to these signals as bits, values, elements,symbols, characters, terms, numbers, or the like.

It should be borne in mind, however, that the above and similar termsare to be associated with the appropriate physical quantities and aremerely convenient labels applied to these quantities. Unlessspecifically stated otherwise, and as apparent from the following, itwill be appreciated that throughout the present specification,discussions utilizing terms such as “scanning”, “calculating”,“determining”, “replacing”, “generating” “initializing”, “outputting”,or the like, refer to the action and processes of a computer system, orsimilar electronic device, that manipulates and transforms datarepresented as physical (electronic) quantities within the registers andmemories of the computer system into other data similarly represented asphysical quantities within the computer system memories or registers orother such information storage, transmission or display devices.

The present specification also discloses apparatus for performing theoperations of the methods. Such apparatus may be specially constructedfor the required purposes, or may comprise a general purpose computer orother device selectively activated or reconfigured by a computer programstored in the computer. The algorithms and displays presented herein arenot inherently related to any particular computer or other apparatus.Various general purpose machines may be used with programs in accordancewith the teachings herein. Alternatively, the construction of morespecialized apparatus to perform the required method steps may beappropriate. The structure of a conventional general purpose computerwill appear from the description below.

In addition, the present specification also discloses a computerreadable medium comprising a computer program for performing theoperations of the methods. The computer readable medium is taken hereinto include any transmission medium for communicating the computerprogram between a source and a designation. The transmission medium mayinclude storage devices such as magnetic or optical disks, memory chips,or other storage devices suitable for interfacing with a general purposecomputer. The transmission medium may also include a hard-wired mediumsuch as exemplified in the Internet system, or wireless medium such asexemplified in the GSM mobile telephone system. The computer program isnot intended to be limited to any particular programming language andimplementation thereof. It will be appreciated that a variety ofprogramming languages and coding thereof may be used to implement theteachings of the disclosure contained herein.

Where reference is made in any one or more of the accompanying drawingsto steps and/or features, which have the same reference numerals, thosesteps and/or features have for the purposes of this description the samefunction(s) or operation(s), unless the contrary intention appears.

The methods of keyword searching in general, and hierarchical datastructure construction in particular, are preferably practiced using ageneral-purpose computer system 3700, such as that shown in FIG. 37wherein the processes of FIGS. 1 to 36 may be implemented as software,such as an application program executing within the computer system3700. In particular, the steps of keyword searching are effected byinstructions in the software that are carried out by the computer. Theinstructions may be formed as one or more code modules, each forperforming one or more particular tasks. The software may also bedivided into two separate parts, in which a first part performs thesearching methods and a second part manages a user interface between thefirst part and the user. The software may then be stored in a computerreadable medium, including the storage devices described below, forexample. The software is loaded into the computer from the computerreadable medium, and then executed by the computer. A computer readablemedium having such software or computer program recorded on it is acomputer program product. The use of the computer program product in thecomputer preferably effects an advantageous apparatus for keywordsearching and hierarchical data structure construction.

The computer system 3700 is formed by a computer module 3701, inputdevices such as a keyboard 3702 and mouse 3703, output devices includinga printer 3715, a display device 3714 and loudspeakers 3717. AModulator-Demodulator (Modem) transceiver device 3716 is used by thecomputer module 3701 for communicating to and from a communicationsnetwork 3720, for example connectable via a telephone line 3721 or otherfunctional medium. The modem 3716 can be used to obtain access to theInternet, and other network systems, such as a Local Area Network (LAN)or a Wide Area Network (WAN), and may be incorporated into the computermodule 3701 in some implementations.

The computer module 3701 typically includes at least one processor unit3705, and a memory unit 3706, for example formed from semiconductorrandom access memory (RAM) and read only memory (ROM). The module 3701also includes an number of input/output (I/O) interfaces including anaudio-video interface 3707 that couples to the video display 3714 andloudspeakers 3717, an I/O interface 3713 for the keyboard 3702 and mouse3703 and optionally a joystick (not illustrated), and an interface 3708for the modem 3716 and printer 3715. In some implementations, the modem3716 may be incorporated within the computer module 3701, for examplewithin the interface 3708. A storage device 3709 is provided andtypically includes a hard disk drive 3710 and a floppy disk drive 3711.A magnetic tape drive (not illustrated) may also be used. A CD-ROM drive3712 is typically provided as a non-volatile source of data. Thecomponents 3705 to 3713 of the computer module 3701, typicallycommunicate via an interconnected bus 3704 and in a manner which resultsin a conventional mode of operation of the computer system 3700 known tothose in the relevant art. Examples of computers on which the describedarrangements can be practised include IBM-PC's and compatibles, SunSparcstations or alike computer systems evolved therefrom.

Typically, the application program is resident on the hard disk drive3710 and read and controlled in its execution by the processor 3705.Intermediate storage of the program and any data fetched from thenetwork 3720 may be accomplished using the semiconductor memory 3706,possibly in concert with the hard disk drive 3710. In some instances,the application program may be supplied to the user encoded on a CD-ROMor floppy disk and read via the corresponding drive 3712 or 3711, oralternatively may be read by the user from the network 3720 via themodem device 3716. Still further, the software can also be loaded intothe computer system 3700 from other computer readable media. The term“computer readable medium” as used herein refers to any storage ortransmission medium that participates in providing instructions and/ordata to the computer system 3700 for execution and/or processing.Examples of storage media include floppy disks, magnetic tape, CD-ROM, ahard disk drive, a ROM or integrated circuit, a magneto-optical disk, ora computer readable card such as a PCMCIA card and the like, whether ornot such devices are internal or external of the computer module 3701.Examples of transmission media include radio or infra-red transmissionchannels as well as a network connection to another computer ornetworked device, and the Internet or Intranets including e-mailtransmissions and information recorded on Websites and the like.

Keyword searching in a hierarchical environment comprises identifyingthe nodes or elements in the hierarchical data structure where thekeyword or keywords occur and then determining what other data elementsare relevant to the keywords. In a typical keyword searching scenario,the resulting data presented to the user is a second hierarchical datastructure extracted from the first data structure and containing all orsome of the search keywords and other data considered to be relevant tothese keywords. Such a hierarchical data structure presented to the useras a result of the keyword search operation is referred to as a contexttree.

When the hierarchical data being searched has a governing schema, as isoften is the case with XML, it is generally advantageous to employ amethod for identifying relevant data that operates at the schema level.That is, elements within the schema representation are analysed todetermine whether they are relevant to the search keywords. Allinstances of data items in the data source collectively represented bythe relevant schema elements are then returned to the user as the resultof the keyword search operation. In XML the governing schema can be inthe form of an XML Schema, which itself is another hierarchical datastructure. An XML Schema specifies the structure of the associated XMLdata, the list of elements and attributes in the XML data and theirparent-child relationships. Since each element or attribute in an XMLSchema typically represents many instances of elements and attributes inthe XML data, an XML Schema is potentially a much smaller data structureand hence can be analysed more efficiently.

It is often desirable to search for keywords in more than onehierarchical data source. Although each hierarchical data source on itsown is tree-structured, when multiple data sources are consideredtogether the resulting data structure may take on a more general form.One such form that invariably arises in a database environment isillustrated in FIG. 1. This structure essentially comprises a number oftrees with shared nodes, where each tree represents the schema of adistinct hierarchical data source and the shared nodes are the result ofdata views whose contents span multiple data sources. Specifically thedotted boxes 1005 and 1010 in FIG. 1 denote the schemas of a first andsecond data source respectively, and node 1015 is the root node of adata view that brings together nodes 1020 and 1025 from the first datasource and node 1030 from the second data source. The multipleshared-tree structure in the FIG. 1, referred herein as a schema graph,is a special form of a directed acyclic graph with an importantcharacteristic that there is at most a single directed path between anytwo nodes. For example, there is only one directed path from node 1015to node 1035 and this path passes through node 1020.

The schema graph is preferably constructed prior to a keyword searchingoperation and is made up of the initially disjoint individualtree-structured schemas of the hierarchical data sources. These schematrees are then joined when a data view is created that spans more thanone data source. A data view typically comprises a query (such as anXQuery in an XML environment) and may be created by a databaseadministrator or user.

In either case, the database system preferably logs or records thesequeries in its storage device. During the construction of the schemagraph, a schema representation of each logged query is created andinserted into the schema graph. This results in a joining of two or moreseparate schema trees if the schema being inserted contains nodes fromthese trees, as illustrated in FIG. 1. It is also possible that theschema being inserted into the schema graph contains nodes from only onedata source, in which case a joining of separate schema trees does notarise. Instead the insertion operation simply results in new nodes beingadded and linked to existing nodes from a single schema tree in theschema graph.

The schema graph may be updated continually as new queries are logged,or it may be updated on one or more occasions after new queries and dataviews have been collected over some period of time. Regardless of howoften the schema graph is updated, when a keyword search operation isinitiated, the schema graph current at the time of the operation is theone that is used to determine data views that are returned to the user.For the remainder of this document, the term “schema graph” refers tothe schema graph that is current at the time a keyword search operationis performed.

As in the case of a single data source, keyword searching withinmultiple hierarchical data sources involves first identifying nodeswithin the schema graph where the search keywords are found, referred toas “hit” nodes, and then identifying nodes that are relevant to the hitnodes, referred to as “context” nodes. A data structure comprising thehit and context nodes is then constructed and presented to the user.Since hit nodes can be located in more than one data source, theresulting data structure presented to the user may span multiple datasources. The resulting data structure is preferably also tree-structuredsince its intended applications are in hierarchical data environments.

FIG. 4 shows a preferred configuration 4000 and generalised mode ofoperation of the keyword searching methods. The configuration 4000comprises a PC client 4005, a data server 4010, a database 4015, akeyword search client 4025, and an index server 4030 connected togetherin a network.

Each of the devices 4005, 4010, 4025 and 4030 is typically formed by acorresponding general purpose computer system, such as the system 3700,each linked by the network 3720, which is only illustrated conceptuallyin FIG. 4. This conceptual illustration is used to provide for anuncluttered representation of data flows between the various devices4005, 4010, 4025 and 4030, and which occur across the network 3720. Whennecessary, appropriate or convenient, the various devices 4005, 4010,4025 and 4030 may be combined into a smaller number of distinct computersystems 3700. For example, in some implementations, it may be convenientto combine the servers 4010 and 4030 into one computer system 3700, andcombine the clients 4005 and 4025 into another computer system 3700,those systems 3700 being linked by the network 3720.

Data stored in the database 4015 is typically accessed by a userbrowsing at the PC client 4005. A browsing application, operating in theclient 4005 issues commands preferably in the form of XQueries 4006which are then transmitted to the data server 4010. Each XQuery 4006 isrecorded in a log 4020 and analysed by the data server 4010, after whichthe requested data 4007 is fetched from the database 4015 and deliveredto the PC client user 4005. At some point in time, preferably after asufficient amount of XQueries 4006 have been logged, the index server4030 is activated and the logged XQueries 4020 are analysed to build anindex table 4035. This process involves constructing a schema graphrepresentation of the data stored in the database 4015 and its existingviews represented by the logged XQueries 4020, building variousfrequency tables associated with these views, identifying searchablekeywords in the database, determining one or more context trees andconstructing a corresponding XQuery for each context tree, and finallyrecording these keywords and XQueries in an index table 4035 for laterquick retrieval.

Once the building process of the index table 4035 completes, the system4000 is ready to perform keyword search operations invoked at thekeyword search client 4025. Search keywords 4026 entered by the user aretransmitted to the index server 4030 where they are looked-up againstthe index table 4035 and one or more XQueries 4031 are retrieved andpresented to the user, appropriately ranked according to their relevanceto the search keywords. When the user selects an XQuery 4027 from thelist, the XQuery 4027 is transmitted by the keyword search client 4025to the data server 4010 which responds with the appropriate data 4011.

The method 2000 of keyword searching involving one or more hierarchicaldata sources is summarised by the flowchart in FIG. 2. The method 2000is preferably executed on the computer of the index server 4030. Themethod 2000 begins at step 2005, where hit nodes are identified in theschema graph. In an XML environment, there are potentially two ways inwhich a hit node can arise in the schema graph: (i) its element name maycontain one of the search keywords or (ii) one or more XML nodes itrepresents may contain one of the search keywords. Subsequent to step2005, step 2010 identifies context trees in the schema graph, eachcomprising nodes in the data sources represented by the hit and contextnodes in the schema graph. Finally at step 2015, the identified contexttrees are converted to XQueries and presented to the user as a rankedlist.

Methods for identifying context trees denoted by step 2010 in FIG. 2 arenow described in detail. A method is first presented for the specialcase where there is a single hit node in the schema graph, followedlater by a more general method that can handle cases involving more thanone hit node. Both methods operate in two phases. The first is abottom-up traversal of the schema graph from the hit nodes to determinewhich of their parents and ancestors are context nodes, from which thesecond phase proceeds in a top-down fashion to determine which of theirdescendants are also context nodes. The topmost ancestor of the hitnodes determined to be a context node then represents the root node ofthe context tree presented to the user as a result of the keyword searchoperation. For the purpose of determining whether a node in the schemagraph is a context node, preferably at least an occurrence frequencytable and a co-occurrence frequency table are maintained. The formerrecords the frequencies at which each node in the schema graph occurredin a logged query or data view whilst the latter records the frequenciesat which pairs of nodes in the schema graph co-occur in the same loggedquery or data view. When the schema graph is updated with a new query ordata view containing new nodes, new entries are added to the occurrencefrequency table to represent the new nodes, and are each given aninitial frequency value of 1 indicating that the nodes are new and havenot previously been observed. Likewise, for each node-pair from the newquery comprising two new nodes or a new node and an existing node, a newentry is added to the co-occurrence frequency table and given an initialfrequency value of 1, whilst for each node-pair comprising a new nodeand an existing node not present in the new query, a new entry is addedto the co-occurrence frequency table but is given an initial frequencyvalue of 0.

As the schema graph is traversed, an occurrence probability is computedfor each node, given the occurrences of the hit nodes. These conditionalprobability values are computed or approximated from values stored inthe frequency tables due to previously logged queries and data views,and are used to determine whether a node is a context node.

The following is a description of the first method for the special casewhere there is a single hit node. Let this hit node be denoted by X.

In the first phase, a bottom-up traversal through the schema graph ismade beginning at node X. Each of X's ancestors Y_(i) is considered inturn and its occurrence probability, given the occurrence of X, iscomputed:

$\quad\begin{matrix}\begin{matrix}{{\Pr\left\lbrack {Y_{i}❘X} \right\rbrack} = \frac{\Pr\left\lbrack {Y_{i}\bigwedge X} \right\rbrack}{\Pr\lbrack X\rbrack}} \\{\approx \frac{{freq}\left( {Y_{i},X} \right)}{{freq}(X)}}\end{matrix} & {{Eq}.\mspace{14mu} 1}\end{matrix}$where the probability value has been approximated by an occurrencefrequency freq(X) and a co-occurrence frequency freq(Y_(i), X). Thelatter denotes the frequency that X and Y_(i) co-occur, where Y_(i) isan ancestor of X. Both are obtained directly from the occurrence andco-occurrence frequency tables stated earlier. From these probabilityvalues computed for the ancestor nodes of X it is possible to determinethe probability that a particular ancestor Y_(i) is a root node, givenX. Let Z₁, . . . Z_(n) denote the parent nodes of Y_(i), thenPr[Y _(i)root|X]=Pr[

Z ₁ ^ . . . ^

Z _(n) ^Y ₁ |X]  Eq. 2

That is, the probability that Y_(i) is root, given X, is the probabilitythat Y_(i) is present and none of its parents are present, given X.Expanding the right hand side of Eq. 2 gives:

$\quad\begin{matrix}\begin{matrix}{{\Pr\left\lbrack {{Y_{i}\;{root}}❘X} \right\rbrack} = {{\Pr\left\lbrack {{⫬ {Z_{1}\bigwedge\ldots\bigwedge{⫬ Z_{n}}}}❘{Y_{i}\bigwedge X}} \right\rbrack}{\Pr\left\lbrack {Y_{i}❘X} \right\rbrack}}} \\{= {\left( {1 - {\Pr\left\lbrack {{Z_{1}\bigvee\ldots\bigvee Z_{n}}❘{Y_{i}\bigwedge X}} \right\rbrack}} \right){\Pr\left\lbrack {Y_{i}❘X} \right\rbrack}}}\end{matrix} & {{Eq}.\mspace{14mu} 3}\end{matrix}$

_(i)nce Z₁, . . . Z_(n) are mutually exclusive given Y_(i), (Y_(i) canhave at most one parent in any actual hierarchical data structure),

$\quad\begin{matrix}\begin{matrix}{{\Pr\left\lbrack {{Y_{i}\;{root}}❘X} \right\rbrack} = {\left( {1 - {\sum\limits_{j}^{\;}{\Pr\left\lbrack {Z_{j}❘{Y_{i}\bigwedge X}} \right\rbrack}}} \right){\Pr\left\lbrack {Y_{i}❘X} \right\rbrack}}} \\{= {{\Pr\left\lbrack {Y_{i}❘X} \right\rbrack} - {\sum\limits_{j}^{\;}{{\Pr\left\lbrack {Z_{j}❘{Y_{i}\bigwedge X}} \right\rbrack}{\Pr\left\lbrack {Y_{i}❘X} \right\rbrack}}}}} \\{= {{\Pr\left\lbrack {Y_{i}❘X} \right\rbrack} - {\sum\limits_{j}^{\;}{\Pr\left\lbrack {{Z_{j}\bigwedge Y_{i}}❘X} \right\rbrack}}}}\end{matrix} & {{Eq}.\mspace{14mu} 4}\end{matrix}$

But since there is at most one directed path between Z_(j) and X (acharacteristic of the schema graph), it follows that this path mustinclude Y_(i), and hence:

$\begin{matrix}{{\Pr\left\lbrack {Z_{j}\bigwedge Y_{i}\bigwedge X} \right\rbrack} = \left. {\Pr\left\lbrack {Z_{j}\bigwedge X} \right\rbrack}\Leftrightarrow \right.} & {{Eq}.\mspace{14mu} 5} \\{{\Pr\left\lbrack {{Z_{j}\bigwedge Y_{i}}❘X} \right\rbrack} = \left. {\Pr\left\lbrack {Z_{j}❘X} \right\rbrack}\Leftrightarrow \right.} & {{Eq}.\mspace{14mu} 6} \\{{\Pr\left\lbrack {{Y_{i}\;{root}}❘X} \right\rbrack} = {{\Pr\left\lbrack {Y_{i}❘X} \right\rbrack} - {\sum\limits_{j}^{\;}{\Pr\left\lbrack {Z_{j}❘X} \right\rbrack}}}} & {{Eq}.\mspace{14mu} 7}\end{matrix}$

In a preferred implementation, a number of alternative context trees arereturned to the user as results of a keyword search operation, one foreach ancestor node Y_(i) of X whose associated probabilityPr[Y_(i)root|X] is greater than zero. These alternative context treesare each assigned a score being the associated probabilityPr[Y_(i)root|X] and sorted according to these scores, from highest tolowest. Context trees with higher scores are considered to be of moreinterest to the user than those with lower scores. In an alternativeimplementation, only the context tree with the highest score ispresented to the user as the result of the keyword search operation.

For each ancestor node Y_(i) that can serve as the root node of acontext tree (ie. whose Pr[Y_(i)root|X]>0), a second phase, top-downtraversal from Y_(i) is performed to determine which of its descendants(except the hit node X) are context nodes. For each parent node P_(j)visited during this phase, an analysis is performed to determine whichof its children are context nodes. For each child node determined to bea context node, its children are in turn analysed in a top-down fashionto identify context nodes among them.

There are two distinct scenarios in the analysis of a parent node P_(j),as illustrated in FIGS. 3A and 3B. The first is a special case shown inFIG. 3A where P_(j) lies along the path from the root node Y_(i) 3005 tothe hit node X 3020. This includes the case P_(j)=Y_(i) but excludes thecase P_(j) ═X. In this scenario, at least one child node of P_(j), 3010,in this case the child node C_(i) 3015, that lies along the path fromP_(j) to X, must be identified as a context node. In the more generalsecond scenario, encompassing all remaining cases as shown in FIG. 3B,the parent node P_(j) 3030 does not lie along a directed path from theroot node Y_(i) to the hit node X 3035 and thus it is not compulsory toidentify any child nodes of P_(j) as context nodes. An algorithm forhandling the second scenario will be presented first.

For a given hit node X and a specific node Y_(i) to serve as the rootnode of a context tree, the choice of whether some child node C_(k) of aparent node P_(j) is to be identified as a context node is in general afunction of the probability that C_(k) occurs given the presence of allnodes along the directed path from Y_(i) to X:Pr[C _(k) |X^ . . . ^Y _(i)root^ . . . ^P _(j)]

Since the evaluation or estimation of this probability is not possiblewith just the occurrence and co-occurrence frequency tables mentionedearlier, some form of simplification or approximation is needed. Onesuch simplification preferably adopted is to ignore the effects of allnodes other than those from Y_(i) to C_(k) in the above probabilityexpression, resulting in the expression:

$\begin{matrix}{{\Pr\left\lbrack {C_{k}❘{Y_{i}\;{{root}\bigwedge\ldots\bigwedge P_{j}}}} \right\rbrack}{where}\begin{matrix}{{\Pr\left\lbrack {C_{k}❘{Y_{i}\;{{root}\bigwedge\ldots\bigwedge P_{j}}}} \right\rbrack} = {\Pr\left\lbrack {C_{k}❘{Y_{i}\;{{root}\bigwedge P_{j}}}} \right\rbrack}} \\{= \frac{\Pr\left\lbrack {{C_{k}\bigwedge Y_{i}}\;{{root}\bigwedge P_{j}}} \right\rbrack}{\Pr\left\lbrack {Y_{i}\;{{root}\bigwedge P_{j}}} \right\rbrack}} \\{= \frac{\Pr\left\lbrack {{C_{k}\bigwedge Y_{i}}\;{root}} \right\rbrack}{\Pr\left\lbrack {Y_{i}\;{{root}\bigwedge P_{j}}} \right\rbrack}}\end{matrix}} & {{Eq}.\mspace{14mu} 8}\end{matrix}$

Let Z_(l), l=1, . . . , p denote the parent nodes of Y_(i), then theright hand side of Eq. 8 can be expanded to

$\begin{matrix}{\frac{{\Pr\left\lbrack {C_{k}\bigwedge Y_{i}} \right\rbrack} - {\sum\limits_{l = 1}^{p}{\Pr\left\lbrack {C_{k}\bigwedge Z_{l}} \right\rbrack}}}{{\Pr\left\lbrack {P_{j}\bigwedge Y_{i}} \right\rbrack} - {\sum\limits_{l = 1}^{p}{\Pr\left\lbrack {P_{j}\bigwedge Z_{l}} \right\rbrack}}} \approx \frac{{{freq}\left( {Y_{i},C_{k}} \right)} - {\sum\limits_{l = 1}^{p}{{freq}\left( {Z_{l},C_{k}} \right)}}}{{{freq}\left( {Y_{i},P_{j}} \right)} - {\sum\limits_{l = 1}^{p}{{freq}\left( {Z_{l},P_{j}} \right)}}}} & {{Eq}.\mspace{14mu} 9}\end{matrix}$

The above expression however, only deals with each individual child nodeC_(k) in isolation. Unless C_(k) are independent from one another (givenY_(i) root), it is necessary to consider their joint probabilities. Thishowever would require maintaining frequency tables storing thejoint-occurrence frequencies of a large number of combinations of nodes,many of which would rarely be observed and hence it would not bepossible to reliably estimate their joint probabilities from theirjoint-occurrence frequencies. On the other hand, assuming independenceamong C_(k) (given Y_(i) root) may lead to undesirable results, such asnone of the child nodes C_(k) being selected if their individualoccurrence probabilities (given Y_(i) root) are low.

In order to avoid the undesirable effects of independence assumptionsamong C_(k), whilst at the same time avoiding the need to maintain alarge number of joint-occurrence frequency values, a heuristic method5000 depicted by the flowchart in FIG. 5 may be used for selecting childnodes as context nodes. The method 5000 preferably operates as asubprogram within the method 2000 upon the server 4030.

The method 5000 begins at step 5005 where the occurrence probability ofeach child node C_(k) given the root node Y_(i), denoted byQ_(k)=Pr[C_(k)|Y_(i)root^P_(j)] is computed using Eq. 8 and Eq. 9. Atthe next step 5010, the probabilities Q_(k) are summed over all childnodes C₁, the sum being denoted by T. The method 5000 continues at step5015 where those nodes C_(k) with the highest probability value areselected as context nodes. If more than one child node exists with thesame highest probability value then all such nodes are selected ascontext nodes. The sum S of the probabilities of all child nodes so farselected as context nodes is then computed at step 5020. Execution thenproceeds to the decision step 5025, at which point if all child nodesC_(k) have been selected as context nodes then the method terminates atstep 5040. If however there are one or more child nodes C_(k) not yetselected as context nodes then the method 5000 continues to anotherdecision step 5030. At step 5030 a check is made to ascertain whetherS≧T/2 and if so the method again terminates at step 5040. If S<T2 thenexecution proceeds to step 5035. Here the list of child nodes C_(k) notyet currently selected as context nodes are examined to identify thosewith the highest probability value among themselves. These are selectedas context nodes and the method 5000 returns to step 5020 for furtherprocessing in the manner discussed above.

The method 5000 has a number of desirable properties:

-   -   If logged queries or data views intersect with sufficiently high        frequencies, (ie. with a relatively large number of child nodes        in common) then the method 5000 tends to return their        intersections as context nodes. This is likely to lead to an        acceptable result since an intersection that is sufficiently        large tends to carry sufficient context information (for the hit        node).    -   If logged queries or data views have relatively few child nodes        in common, then the resulting set of context nodes tends to        comprise not only their intersections but also additional nodes.        Experiments conducted by the present inventor show the resulting        set of context child nodes tends to reflect that of the most        frequent logged query or data view. This is significant since        the intersections alone would not likely contain sufficient        context information.    -   Due to the inclusion of child nodes with identically highest        probability value as a whole, the method 5000 is biased towards        identifying more rather than less nodes as context nodes. In the        case where the sets of child nodes present in logged queries are        mutually exclusive and occur with equal frequencies, the method        identifies all child nodes as context nodes.

In the method 5000 when a parent node is identified as a context node,one or more of its children are always identified as context nodes aswell. This may be undesirable if there are many logged queries or dataviews in which the parent node occurs without any of its children (ie.occurs as a leaf node). Intuitively, if this occurs sufficiently oftenthen the parent node alone should be identified as a context nodewithout any of its children to reflect the frequently observedbehaviour.

To remedy this issue, a preferred implementation makes use of anadditional leaf co-occurrence frequency table, generated and stored bythe index server 4030. This table stores the frequency at which a nodeP_(j) co-occurs as a leaf node in past logged queries and data viewswith its ancestor Y_(i), for every possible pairs of such nodes P_(j)and Y_(i), excluding those nodes P_(j) that have no children in theschema graph. This new frequency table is then used to estimate theprobability that a node P_(j) occurs as a leaf node, given P_(j) andsome root node Y_(i):

$\quad\begin{matrix}\begin{matrix}{{\Pr\left\lbrack {{P_{j}\;{leaf}}❘{Y_{i}\;{{root}\bigwedge P_{j}}}} \right\rbrack} = \frac{\Pr\left\lbrack {P_{j}\;{{leaf}\bigwedge Y_{i}}\;{root}} \right\rbrack}{\Pr\left\lbrack {Y_{i}\;{{root}\bigwedge P_{j}}} \right\rbrack}} \\{\approx \frac{{{freq}\left( {Y_{i},{P_{j}\;{leaf}}} \right)} - {\sum\limits_{l = 1}^{p}{{freq}\left( {Z_{l},{P_{j}\;{leaf}}} \right)}}}{{{freq}\left( {Y_{i},P_{j}} \right)} - {\sum\limits_{l = 1}^{p}{{freq}\left( {Z_{l},P_{j}} \right)}}}}\end{matrix} & {{Eq}.\mspace{14mu} 10}\end{matrix}$where Z_(l), l=1, . . . , p denote the parent nodes of Y_(i) as definedearlier, and freq(Y_(i), P_(j) leaf) and freq(Z_(j), P_(j) leaf) areco-occurrence frequency values obtained from the new leaf co-occurrencefrequency table.

The probability Pr[P_(j)leaf|Y_(i)root^P_(j)] is preferably determinedin an additional decision step prior to the method 5000 given in FIG. 5for identifying which child nodes of P_(j) are context nodes. IfPr[P_(j)leaf|Y_(i)root^P_(j)] is less than 0.5, then no child nodes ofP_(j) are selected as context nodes, otherwise the method 5000 isperformed to identify which child nodes are context nodes.

An alternative implementation is also possible, and employs analternative method 6000 whose flowchart is given in FIG. 6 for selectingcontext nodes among a set of child nodes C_(k), k=1, . . . , m. Themethod 6000, which is also performed by the index server 4030, begins atstep 6001 where a fictitious child node C₀ is conceptually created andadded to the list of actual child nodes C₁, . . . , C_(m) and isassigned a probability value Pr[P_(j)leaf|Y_(i)root^P_(j)] using Eq. 10.At the next step 6005, the actual child nodes C_(k) are assigned theirusual probability values Q_(k)=Pr[C_(k)|Y_(i)root^P_(j)] using Eq. 8 andEq.

9. The method 6000 then continues at step 6006 by invoking method 5000at step 5010 (skipping step 5005) to select among the child nodes C₀, .. . , C_(m) a set of context nodes. When the method 5000 exits, themethod 6000 resumes at decision step 6010 where a check is made todetermine if the fictitious child node C₀ has been selected as a contextnode. If so then execution continues at step 6020 where C₀ is excludedas a context node. The method 6000 subsequently terminates at step 6015.If the test at 6010 fails, then the method 6000 proceeds directly to thetermination step 6015.

The idea behind the alternative method 6000 for incorporating thepossibility that none of P_(j)'s child nodes are context nodes isessentially identical to that of the first. That is, whenPr[P_(j)leaf|Y_(i)root^P_(j)] is sufficiently large. However, theeffects of Pr[P_(j)leaf|Y_(i) root^P_(j)] on the resulting set ofcontext nodes are more gradual in this alternative approach, which isgenerally more favourable than the abrupt on/off behaviour of the firstapproach.

For the special scenario where the parent node P_(j) lies along thedirected path from the root node Y_(i) to the hit node X, specialconsiderations must be made to ensure that the child node of P_(j) thatlies along the path from P_(j) to X is identified as a context node.Without loss of generality, let this child node be C₁ as illustrated inFIG. 3 as item 3015. Whilst the method 5000 presented earlier for thegeneral scenario can be modified (for example by inflating theoccurrence probability of C₁ above those of all other child nodes priorto step 5015), such an approach may not yield correct results. This isbecause the method 5000 as described has been devised to select a set ofthe most frequently occurring child nodes as context nodes given theroot Y_(i) and parent P_(j). If this set does not naturally contain C₁,then it basically means that C₁ is not related to nodes in the set.Forcefully including C₁ would simply result in a set of child nodes thathave little in common and provide little context for C₁ (andsubsequently for X).

Instead of modifying the method 5000, a different but somewhatprocedurally similar method 7000 illustrated in FIG. 7 is preferablyadopted in another implementation. The difference between this new 7000and the previous 5000 methods lies in the independence of probabilityassumption used. Recall that the first simplification made in thegeneral case where P_(j) does not lie along the directed path from Y_(i)to X was the assumption thatPr[C _(k) |X^ . . . ^Y _(i)root^ . . . ^P _(j)]is independent of nodes other than those from Y_(i) to P_(j). In thecurrent scenario where one child node C₁ of P_(j) lies along the pathfrom X to P_(j), it would not be sensible to assume that C_(k) isindependent of nodes from P_(j) to X (including C₁) as they arenecessary ancestors of X that link the hit keyword X to C_(k). Sincesome simplifications are necessary to keep the problem tractable, itfollows that a better choice is to assume an independence of probabilityassumption between C_(k) and its ancestors above P_(j) towards the rootnode Y_(i). With this assumption, the probabilities of interest arePr[C _(k) |X^ . . . ^C ₁ ^P _(j) ] k≠1

Again, since there is at most one directed path linking X and P_(j), theabove expression is equivalent to

$\begin{matrix}{{\Pr\left\lbrack {C_{k}❘{X\bigwedge P_{j}}} \right\rbrack} = \frac{\Pr\left\lbrack {C_{k}\bigwedge X\bigwedge P_{j}} \right\rbrack}{\Pr\left\lbrack {X\bigwedge P_{j}} \right\rbrack}} & {{Eq}.\mspace{14mu} 11}\end{matrix}$

The numerator on the right hand side of Eq. 11 can not be obtained fromthe occurrence and co-occurrence frequency tables so far mentioned,since it involves three rather than two nodes. An extra joint-occurrencefrequency table between 3-tuples of nodes is therefore required.Fortunately as each of these 3-tuples comprises a pair of parent-childnodes C_(k) and P_(j) (rather than any arbitrary pair of nodes), andsince each node C_(k) in practice has only a small number of parents,the new joint-occurrence frequency table would only be slightly largerthan a co-occurrence frequency table involving pairs of nodes.

With the new joint-occurrence frequency table, Pr[C_(k)|X^P_(j)] can beestimated as

$\begin{matrix}{{\Pr\left\lbrack {C_{k}❘{X\bigwedge P_{j}}} \right\rbrack} \approx \frac{{freq}\left( {C_{k},P_{j},X} \right)}{{freq}\left( {P_{j},X} \right)}} & {{Eq}.\mspace{14mu} 12}\end{matrix}$where freq(C_(k), P_(j), X) denotes the joint-occurrence frequencybetween nodes C_(k), P_(j) and X, P_(j) is a parent of C_(k) and anancestor of X, and C_(k) is neither X nor an ancestor of X.

The method 7000 for determining the set of siblings of C₁ to be includedwith C₁ as context nodes is very similar to method 5000 alreadydescribed. The method 7000 begins at step 7001 where the occurrenceprobability of each child node C_(k)≠C₁ given the parent node P_(j) andthe hit node X, denoted by Q_(k)=Pr[C_(k)|X^P_(j)] is computed using Eq.12. At the next step 7005, the probabilities Q_(k) are summed over allchild nodes C_(k)≠C₁, the sum being denoted by T. The method 7000continues at step 7010 where node C₁ is selected as a context node, andthen subsequently at step 7015 where those nodes C_(k)≠C₁ with thehighest probability value are also selected as context nodes. If morethan one child node exists with the same highest probability value thenall such nodes are selected as context nodes. The sum of theprobabilities of all child nodes so far selected as context nodesexcluding C₁ is then computed at step 7020, the sum being denoted by S.Execution then proceeds to the decision step 7025, at which point if allchild nodes C_(k) have been selected as context nodes then the method7000 terminates at step 7040. If however there are one or more childnodes C_(k) not yet selected as context nodes then the method continuesto another decision step 7030. At step 7030 a check is made to ascertainwhether S≧T/2 and, if so, the method 7000 again terminates at step 7040.If S<T/2 then execution proceeds to step 7035. Here the list of childnodes C_(k)≠C₁ not yet currently selected as context nodes are examinedto identify those with the highest probability value among themselves.These are selected as context nodes and the method returns to step 7020for further processing.

Some modifications are needed to method 7000 to allow for cases where nosiblings of C₁ are included in the solution. This is achieved byintroducing a sole child co-occurrence frequency table that stores thefrequency that a node P_(j) co-occurs with one of its descendants X suchthat only one child node of P_(j) (C₁ along the path from P_(j) to X) ispresent in past logged queries and data views. This frequency table isthen used to estimate the probability that C₁ has no sibling given itsparent P_(j) and the hit node X:

$\quad\begin{matrix}\begin{matrix}{{\Pr\left\lbrack {{C_{1}\mspace{14mu}{no}\mspace{14mu}{sibling}}❘{P_{j}\bigwedge X}} \right\rbrack} = {\Pr\left\lbrack {{C_{1}\bigwedge{⫬ {C_{k}{\forall{k \neq 1}}}}}❘{P_{j}\bigwedge X}} \right\rbrack}} \\{= {\Pr\left\lbrack {{P_{j}\mspace{14mu}{has}\mspace{14mu} 1\mspace{14mu}{child}}❘{P_{j}\bigwedge X}} \right\rbrack}} \\{= \frac{\Pr\left\lbrack {P_{j}\mspace{14mu}{has}\mspace{14mu} 1\mspace{14mu}{{child}\bigwedge P_{j}\bigwedge X}} \right\rbrack}{\Pr\left\lbrack {P_{j}\bigwedge X} \right\rbrack}} \\{= \frac{\Pr\left\lbrack {P_{j}\mspace{14mu}{has}\mspace{14mu} 1\mspace{14mu}{{child}\bigwedge X}} \right\rbrack}{\Pr\left\lbrack {P_{j}\bigwedge X} \right\rbrack}} \\{\approx \frac{{freq}\left( {{P_{j}\mspace{14mu}{has}\mspace{14mu} 1\mspace{14mu}{child}},X} \right)}{{freq}\left( {P_{j},X} \right)}}\end{matrix} & {{Eq}.\mspace{14mu} 13}\end{matrix}$where freq(P_(j) has 1 child, X) denotes the frequency at which nodeP_(j) co-occurs with its descendant X and P_(j) has a single child node(C₁), and is obtained from the new frequency table.

In one implementation, the probability Pr[C₁ no sibling|P_(j)^X] is usedin an additional decision step prior to the method 7000 given in FIG. 7for identifying which child nodes of P_(j) are context nodes. If Pr[C₁no sibling|P_(j)^X] is less than 0.5, then no child nodes of P_(j) otherthan C₁ are selected as context nodes, otherwise method 7000 isperformed to identify which child nodes are context nodes.

An alternative implementation is also possible. This employs analternative method 8000 whose flowchart is given in FIG. 8 for selectingcontext nodes among a set of child nodes C_(k)., k=1, . . . , m. Themethod 8000 begins at step 8001 where a fictitious child node C₀ isconceptually created and added to the list of actual child nodes C₁, . .. , C_(m)and is assigned a probability value Q₀=Pr[C₁ nosibling|P_(j)^X] using Eq. 13. At the next step 8005, the actual childnodes C_(k) except C₁ are assigned their usual probability valuesQ_(k)=Pr[C_(k)|X^P_(j)] using Eq. 11. The method 8000 then continues atstep 8006 by invoking method 7000 at step 7005 (skipping step 7001) toselect among the child nodes C₀, . . . , C_(m)a set of context nodes.When method 7000 exits, method 8000 resumes at decision step 8010 wherea check is made to determine if the fictitious child node C₀ has beenselected as a context node. If so then execution continues at step 8020where C₀ is excluded as a context node. The method 8000 subsequentlyterminates at step 8015. If the test at 8010 fails, then the methodproceeds directly to the termination step 8015.

The preceding discussion describes two distinct methods 6000 and 8000for determining from a set of child nodes which are context nodes.Preferably the latter is applied in the scenario where the parent nodeP_(j) lies along the directed path from the root node Y_(i) to the hitelement X, whilst the former is used for all other parent nodes. In analternative implementation, the first method 6000 is employed even forthe case where P_(j) lies along the path from Y_(i) to X. If thisresults in a set of context child nodes that includes C₁, then the setis adopted, otherwise the set is discarded and the second method 8000 isapplied to determine a new set of context child nodes. The rationalebehind this favouring of the first method is that the probability valuescomputed there are conditional on the root element Y_(i), rather than onthe hit node X. Tests conducted by the present inventor seem to suggestthat the root element of a data view tends to be a better indicator ofwhat nodes are present in the view.

The keyword searching system 4000 disclosed herein is a form of alearning system. From a set of logged queries and existing data views,which are akin to training examples, the system is able to synthesisenew views of data. If patterns exist in the logged queries or dataviews, then they will be reflected in the frequency tables which in turnwill affect the behaviour of the system 4000. A desirable feature forany learning system is an ability to make some form of generalisationthat allows it to use patterns learned from one set of problems toimprove its performance when handling related but yet unseen problems.One aspect of generalisation that is important in a hierarchicalenvironment is the ability to observe occurrence patterns of certainsub-structures of data and generalise them to other similar or identicalsub-structures.

Consider the data structure 9000 shown in FIG. 9, in which there are twoidentical “Employee” sub-structures 9010 and 9030 (enclosed within thedotted curves), one under “Manager” 9005 and the other under “ProjectMembers” 9025. Suppose that in all logged queries and data views, thesub-elements “FirstName” 9015 and “LastName” 9020 in the first Employeesub-tree have always been observed to appear together, whilst no queriesor data views containing the second “Employee” sub-tree 9030 have yet tobe observed. Suppose further that a keyword search operation for anemployee's name is invoked in which a “hit” is found in the “FirstName”sub-element 9035 of the second “Employee” sub-tree 9030, making 9035 thehit node. Even though no example queries or views have been encounteredwith this sub-element present, it is intuitively apparent that from theoccurrence patterns observed for the first “Employee” sub-tree 9010, thesub-element “LastName” 9040 in the second “Employee” sub-tree 9030should be identified as a context node.

Such a generalisation ability is particularly important when workingwith XML data since identical data sub-structures often exist at severallocations in a data hierarchy (for example, as a result of the use ofreferenced schema elements). Such may be realised through probabilityaveraging. Probability averaging works by appropriately averaging theoccurrence probabilities of nodes in the schema graph that haveidentical names or IDs or labels. The application of probabilityaveraging is now described firstly for the first top-down phase of theconstruction of the context tree, and then subsequently for the secondbottom-up phase.

Recall that the operation of the first phase relies on the probabilityvalues Pr[Y_(i)|X], where Y_(i) are ancestors of the hit node X. Tofacilitate probability averaging, Pr[Y_(i)|X] is preferably firstreformulated into an incremental form, as follows: Let W be a child ofY_(i) that lies along the one and only directed path from Y_(i) to X.Pr[Y_(i)|X] can then be rewritten as

(the path from Y_(i) to X must include W)

$\begin{matrix}{{{\Pr\left\lbrack {Y_{i}❘X} \right\rbrack} = {\frac{\Pr\left\lbrack {Y_{i}\bigwedge X} \right\rbrack}{\Pr\lbrack X\rbrack}\mspace{110mu} = \frac{\Pr\left\lbrack {Y_{i}\bigwedge W\bigwedge X} \right\rbrack}{\Pr\lbrack X\rbrack}}}\mspace{115mu}\left( {{the}\mspace{14mu}{path}\mspace{14mu}{from}\mspace{14mu} Y_{i}\mspace{14mu}{to}\mspace{14mu} X\mspace{14mu}{must}\mspace{14mu}{include}\mspace{14mu} W} \right)\begin{matrix}{\mspace{110mu}{= \frac{{\Pr\left\lbrack {Y_{i}❘{W\bigwedge X}} \right\rbrack}{\Pr\left\lbrack {W\bigwedge X} \right\rbrack}}{\Pr\lbrack X\rbrack}}} \\{= {{\Pr\left\lbrack {Y_{i}❘{W\bigwedge X}} \right\rbrack}{\Pr\left\lbrack {W❘X} \right\rbrack}}}\end{matrix}} & {{Eq}.\mspace{14mu} 14}\end{matrix}$

That is, Pr[Y_(i)|X] can be incrementally obtained from the probabilityvalue of its child node W, namely Pr[W|X]. The idea is to begin theprocedure at the hit node X and make use of the above expression toobtain probability values for successively higher ancestor nodes. Ateach step, the method of probability averaging is then applied to thefirst term on the right hand side of Eq. 14. Thus, let Pr′[B|X] denotethe modified probability value of some node B as a result of probabilityaveraging, then Pr′[Y_(i)|X] can be defined by the following recursiveformulae:

$\begin{matrix}{{\Pr^{\prime}\left\lbrack {X❘X} \right\rbrack} = 1} & {{Eq}.\mspace{14mu} 15} \\{{{\Pr^{\prime}\left\lbrack {Y_{i}\text{❘} X} \right\rbrack} = \begin{Bmatrix}0 & {{{{if}\mspace{14mu}{\Pr^{\prime}\left\lbrack {W❘X} \right\rbrack}} = 0}\;} \\{{\Pr_{mean}\left\lbrack {Y_{i}❘{W\bigwedge X}} \right\rbrack}{\Pr^{\prime}\left\lbrack {W❘X} \right\rbrack}} & {otherwise}\end{Bmatrix}}{where}} & {{Eq}.\mspace{14mu} 16} \\\begin{matrix}{{\Pr_{mean}\left\lbrack {Y_{i}❘{W\bigwedge X}} \right\rbrack} = \frac{\sum\limits_{k}{{\Pr\left\lbrack {Y_{ik}❘{W_{k}\bigwedge X_{k}}} \right\rbrack}{\Pr\left\lbrack {W_{k}\bigwedge X_{k}} \right\rbrack}}}{\sum\limits_{k}{\Pr\left\lbrack {W_{k}\bigwedge X_{k}} \right\rbrack}}} \\{= \frac{\sum\limits_{k}{\Pr\left\lbrack {Y_{ik}\bigwedge W_{k}\bigwedge X_{k}} \right\rbrack}}{\sum\limits_{k}{\Pr\left\lbrack {W_{k}\bigwedge X_{k}} \right\rbrack}}} \\{= {\frac{\sum\limits_{k}{\Pr\left\lbrack {Y_{ik}\bigwedge X_{k}} \right\rbrack}}{\sum\limits_{k}{\Pr\left\lbrack {W_{k}\bigwedge X_{k}} \right\rbrack}}\mspace{11mu}\left( {{the}\mspace{14mu}{path}\mspace{14mu}{from}\mspace{14mu} Y_{ik}\mspace{11mu}{to}\mspace{14mu} X_{k}\mspace{14mu}{must}\mspace{14mu}{include}\mspace{14mu} W_{ik}} \right)}} \\{\approx \frac{\sum\limits_{k}{{freq}\left( {Y_{ik},X_{k}} \right)}}{\sum\limits_{k}{{freq}\left( {W_{k},X_{k}} \right)}}}\end{matrix} & {{Eq}.\mspace{14mu} 17}\end{matrix}$and denotes the weighted average or mean probability of Y_(i) given Wand X computed over all pairs of nodes (Y_(ik), X_(k)) (for some valuesof k) that are equivalent to (Y_(i), X), with X₀ and Y_(i0) (ie. k=0)being aliases for X and Y_(i) respectively. For each of these equivalentpairs (Y_(ik), X_(k)), the term W_(k) in the summations denotes theimmediate child of Y_(ik) lying along the directed path from Y_(ik) toX_(k).

A node pair (Y_(ik), X_(k)) is said to be equivalent to a node pair(Y_(i), X) if

-   -   (i) Y_(ik) has the same name or label or ID as Y_(i) and X_(k)        has the same name, label or ID as X,    -   (ii) there are direct ancestor-descendant relationship between        Y_(ik) and X_(k) and similarly between Y_(i) and X,    -   (iii) for each node W_(k) along the directed path from Y_(ik) to        X_(k), there must exist a corresponding node W along the        directed path from Y_(i) and X such that (W_(k), X_(k)) is        equivalent to (W, X) and (Y_(i), W_(k)) is equivalent to (Y_(i),        W).    -   (iv) Y_(i) and Y_(ik) have exactly the same number of parents        and for each parent Z_(j) of Y_(i), there exists a parent Z_(kj)        of Y_(ik) such that (Z_(kj), Y_(ik)) and (Z_(j), Y_(i)) satisfy        conditions (i) to (iii) above.

The modified probability that Y_(i) is root given X due to probabilityaveraging is then given by

$\begin{matrix}{{{\Pr^{\prime}\left\lbrack {{Y_{i}{root}}❘X} \right\rbrack} = {{\Pr^{\prime}\left\lbrack {Y_{i}❘X} \right\rbrack} - {\sum\limits_{j}{\Pr^{\prime}\left\lbrack {Z_{j}❘X} \right\rbrack}}}}{where}} & {{Eq}.\mspace{14mu} 18} \\{{\Pr^{\prime}\left\lbrack {Z_{j}❘X} \right\rbrack} = {{\Pr_{mean}\left\lbrack {Z_{j}❘{Y_{i}\bigwedge X}} \right\rbrack}{\Pr^{\prime}\left\lbrack {Y_{i}❘X} \right\rbrack}}} & {{Eq}.\mspace{14mu} 19}\end{matrix}$as obtained from Eq. 16 by replacing Y_(i) with Z_(j) and W with Y_(i).

In the event that the denominator on the right hand side of Eq. 17 iszero, indicating that none of the node pairs (W_(k), X_(k)) has beenobserved in logged queries and data views, Eq. 17 and hence Eq. 19 andEq. 18 are undefined and consequently some alternative methods foridentifying context nodes are needed. A preferred approach is toalternatively define Pr_(mean)[Z_(j)|Y_(i)^X] in terms of the distanceof Z_(j) from the hit node X as follows:

$\begin{matrix}{{\Pr_{mean}\left\lbrack {Z_{j}❘{Y_{i}\bigwedge X}} \right\rbrack} = \left\{ \begin{matrix}\frac{\sum\limits_{k}{{freq}\left( {Z_{jk},X_{k}} \right)}}{\sum\limits_{k}{{freq}\left( {Y_{ik},X_{k}} \right)}} & {{{if}\mspace{14mu}{\sum\limits_{k}{{freq}\left( {Y_{ik},X_{k}} \right)}}} \neq 0} \\1 & {{{{if}\mspace{14mu}{\sum\limits_{k}{{freq}\left( {Y_{ik},X_{k}} \right)}}} = 0},{{{dist}\left( {Z_{j},X} \right)} \leq d_{\max}}} \\0 & {{{{if}\mspace{14mu}{\sum\limits_{k}\;{{freq}\left( {Y_{ik},\; X_{k}} \right)}}} = 0},{{{dist}\left( {Z_{j},\; X} \right)}\; > \; d_{\max}}}\end{matrix} \right.} & {{Eq}.\mspace{14mu} 20}\end{matrix}$where d_(max) is some threshold constant, and dist(A, B) is the distancebetween two nodes A and B in the schema graph, defined as the number oflinks along the path between A and B. In the absence of relevant pastlogged queries and data views, the distance between two nodes shouldgive a good indication of how they are related to one another since inpractice related data are usually stored in proximity of each other.

A flowchart of a method 10000 for computing the probability that anancestor node Y_(i) of a hit node X is the root node of a context treewith probability averaging, for all ancestor nodes Y_(i), is shown inFIG. 10. The method 10000 begins at step 10001 with Y_(i)=X and hencePr′[Y_(i)|X]=1. At the next step 10005, Eq. 19 and Eq. 20 are used tocompute Pr′[Z_(j)|X] for each parent node Z_(j) of Y_(i). Subsequent tostep 10005, step 10010 computes Pr′[Y_(i) root X] according to Eq. 18.Step 100025 then tests to determine whether all parent nodes of Y_(i)have been processed. If not, the method 10000 then proceeds to step10015 where a parent node Z_(j) of Y_(i) is selected. Upon reaching step10020, the method 10000 is recursively invoked at step 10005 (skippingstep 10001) but with the selected parent node Z_(j) playing the role ofY_(i). When this invocation returns, the current execution of method10000 resumes and returns to step 10025 to check for more parent nodes.When all parent nodes have been processed the method 10000 ends at step10030

Probability averaging is also applied to the second top-down traversalphase. In this phase, for the general case in where a parent node P_(j)does not lie along a directed path from the root node Y_(i) to the hitnode X, probability averaging can be applied in the same way as thatused in the first phase. The selection of child nodes C_(k) of Y_(i) forinclusion in the keyword search result as a context node is based on theprobabilitiesPr[C _(k) |Y _(i) root^P _(j)]With probability averaging, the above expression is replaced by a meanprobability

$\begin{matrix}{{\Pr_{mean}\left\lbrack {C_{k}❘{Y_{i}{{root}\bigwedge P_{j}}}} \right\rbrack} = \frac{\sum\limits_{h}{\Pr\left\lbrack {{C_{kh}\bigwedge Y_{ih}}{root}} \right\rbrack}}{\sum\limits_{h}{\Pr\left\lbrack {Y_{ih}{{root}\bigwedge P_{jh}}} \right\rbrack}}} & {{Eq}.\mspace{14mu} 21}\end{matrix}$where (Y_(ih), C_(kh)) is equivalent to (Y_(i), C_(k)) and (P_(jh),C_(kh)) is equivalent to P_(j), C_(k)), with Y_(i0), C_(k0) and P_(j0)(ie. h=0) being aliases for Y_(i), C_(k) and P_(j) respectively. Let Z;denote the parents of Y_(i), and similarly Z_(jh) the correspondingparents of Y_(ih). The above expression can be expanded to

$\begin{matrix}\begin{matrix}{{\Pr_{mean}\left\lbrack {C_{k}❘{Y_{i}{{root}\bigwedge P_{j}}}} \right\rbrack} = \frac{\sum\limits_{h}\left\{ {{\Pr\left\lbrack {C_{kh}\bigwedge Y_{ih}} \right\rbrack} - {\sum\limits_{j}{\Pr\left\lbrack {C_{kh}\bigwedge Z_{jh}} \right\rbrack}}} \right\}}{\sum\limits_{h}\left\{ {{\Pr\left\lbrack {P_{jh}\bigwedge Y_{ih}} \right\rbrack} - {\sum\limits_{j}{\Pr\left\lbrack {P_{jh}\bigwedge Z_{jh}} \right\rbrack}}} \right\}}} \\{\approx \frac{\sum\limits_{h}\left\{ {{{freq}\left( {Y_{ih},C_{kh}} \right)} - {\sum\limits_{j}{{freq}\left( {Z_{jh},C_{kh}} \right)}}} \right\}}{\sum\limits_{h}\left\{ {{{freq}\left( {Y_{ih},P_{jh}} \right)} - {\sum\limits_{j}{{freq}\left( {Z_{jh},P_{jh}} \right)}}} \right\}}}\end{matrix} & {{Eq}.\mspace{14mu} 22}\end{matrix}$

For the above expression to be an accurate approximation of the meanprobability Pr_(mean)[C_(k)|Y_(i)root^P_(j)], the denominator on theright hand side must be sufficiently large (eg. >some positive constantf_(min)). When this is not the case, a preferred remedial method adoptedin a preferred implementation is used to first approximatePr_(mean)[C_(k)|Y_(i)root^P_(j)] by Pr_(mean)[C_(k)|Y_(i)^P_(j)], wherethe probability is conditional on Y_(i)^P_(j) rather thanY_(i)root^P_(j). Thus

$\begin{matrix}\begin{matrix}{{\Pr_{mean}\left\lbrack {C_{k}❘{Y_{i}{{root}\bigwedge P_{j}}}} \right\rbrack} \approx {\Pr_{mean}\left\lbrack {C_{k}❘{Y_{i}\bigwedge P_{j}}} \right\rbrack}} \\{= \frac{\sum\limits_{h}{\Pr\left\lbrack {C_{kh}\bigwedge Y_{ih}} \right\rbrack}}{\sum\limits_{h}{\Pr\left\lbrack {Y_{ih}\bigwedge P_{jh}} \right\rbrack}}} \\{\approx \frac{\sum\limits_{h}{{freq}\left( {Y_{ih},C_{kh}} \right)}}{\sum\limits_{h}{{freq}\left( {Y_{ih},P_{jh}} \right)}}}\end{matrix} & {{Eq}.\mspace{14mu} 23}\end{matrix}$

If the denominator on the right hand side of Eq. 23 is still notsufficiently large, then Pr_(mean)[C_(k)|Y_(i)^P_(j)] is furtherapproximated by a probability conditioned on W rather than Y_(i), whereW is the immediate child of Y_(i) and an ancestor of C_(k). That is

$\quad\begin{matrix}\begin{matrix}{{\Pr_{mean}\left\lbrack {C_{k}❘{Y_{i}{{root}\bigwedge P_{j}}}} \right\rbrack} \approx {\Pr_{mean}\left\lbrack {C_{k}❘{W\bigwedge P_{j}}} \right\rbrack}} \\{= \frac{\sum\limits_{h}{\Pr\left\lbrack {C_{kh}\bigwedge W_{h}} \right\rbrack}}{\sum\limits_{h}{\Pr\left\lbrack {W_{h}\bigwedge P_{jh}} \right\rbrack}}} \\{\approx \frac{\sum\limits_{h}{{freq}\left( {W_{h},C_{kh}} \right)}}{\sum\limits_{h}{{freq}\left( {W_{h},P_{jh}} \right)}}}\end{matrix} & {{Eq}.\mspace{14mu} 24}\end{matrix}$

The method is repeated further until a sufficiently large value isobtained for the denominator on the right hand side, or if not, until Wdenotes a parent of C_(k). If the latter thenPr_(mean)[C_(k)|Y_(i)root^P_(j)] is assigned a value based on thedistance between C_(k) and Y_(i)

$\begin{matrix}{{\Pr_{mean}\left\lbrack {C_{k}❘{Y_{i}{{root}\bigwedge P_{j}}}} \right\rbrack} \approx \left\{ \begin{matrix}{{1\mspace{14mu}{if}\mspace{14mu}{{dist}\left( {C_{k},Y_{i}} \right)}} \leq d_{\max}} \\{0\mspace{14mu}{otherwise}}\end{matrix} \right.} & {{Eq}.\mspace{14mu} 25}\end{matrix}$or the distance between C_(k) and the hit node X:

$\begin{matrix}{{\Pr_{mean}\left\lbrack {C_{k}❘{Y_{i}{root}}} \right\rbrack} \approx \left\{ \begin{matrix}{{1\mspace{14mu}{if}\mspace{14mu}{{dist}\left( {C_{k},X} \right)}} \leq d_{\max}} \\{0\mspace{14mu}{otherwise}}\end{matrix} \right.} & {{Eq}.\mspace{14mu} 26}\end{matrix}$

Depending on whether Pr_(mean)[C_(k)|Y_(i)root^P_(j)] is eventuallyapproximated by Eq. 22, Eq. 23, Eq. 24, Eq. 25 or Eq. 26, the meanprobability that a parent node P_(j) has no context child nodes giventhe root node Y_(i), is computed using Eq. 27, Eq. 28, Eq. 29, Eq. 30 orEq. 31 respectively

$\begin{matrix}\begin{matrix}{{\Pr_{mean}\left\lbrack {{P_{j}\;{leaf}}❘{Y_{i}\;{{root}\bigwedge P_{j}}}} \right\rbrack} = \frac{\sum\limits_{h}^{\;}{\Pr\left\lbrack {P_{jh}\;{{leaf}\bigwedge Y_{ih}}{root}} \right\rbrack}}{{\sum\limits_{h}^{\;}{\Pr\left\lbrack {Y_{ih}\;{{root}\bigwedge P_{jh}}} \right\rbrack}}\;}} \\{\approx \frac{\sum\limits_{h}\begin{Bmatrix}{{{freq}\left( {Y_{ih},{P_{jh}\;{leaf}}} \right)} -} \\{\sum\limits_{j}^{\;}{{freq}\left( {Z_{jh},{P_{jh}\;{leaf}}} \right)}}\end{Bmatrix}}{\sum\limits_{h}^{\;}\begin{Bmatrix}{{{freq}\left( {Y_{ih},P_{jh}}\; \right)} -} \\{\sum\limits_{j}^{\;}{{freq}\left( {Z_{jh},P_{jh}}\; \right)}}\end{Bmatrix}}}\end{matrix} & {{Eq}.\mspace{14mu} 27} \\\begin{matrix}{{\Pr_{mean}\left\lbrack {{P_{j}\;{leaf}}❘{Y_{i}\;{{root}\bigwedge P_{j}}}} \right\rbrack} \approx {\Pr_{mean}\left\lbrack {{P_{j}\;{leaf}}❘{Y_{i}\;\bigwedge P_{j}}} \right\rbrack}} \\{= \frac{\sum\limits_{h}^{\;}{\Pr\left\lbrack {P_{jh}\;{{leaf}\bigwedge Y_{ih}}} \right\rbrack}}{{\sum\limits_{h}^{\;}{\Pr\left\lbrack {Y_{ih}\;\bigwedge P_{jh}} \right\rbrack}}\;}} \\{\approx \frac{\sum\limits_{h}{{freq}\left( {Y_{ih},{P_{jh}\;{leaf}}} \right)}}{\sum\limits_{h}{{freq}\left( {Y_{ih},P_{jh}}\; \right)}}}\end{matrix} & {{Eq}.\mspace{14mu} 28} \\\begin{matrix}{{\Pr_{mean}\left\lbrack {{P_{j}\;{leaf}}❘{Y_{i}\;{{root}\bigwedge P_{j}}}} \right\rbrack} \approx {\Pr_{mean}\left\lbrack {{P_{j}\;{leaf}}❘{W\bigwedge P_{j}}} \right\rbrack}} \\{= \frac{\sum\limits_{h}^{\;}{\Pr\left\lbrack {P_{jh}\;{{leaf}\bigwedge W_{h}}} \right\rbrack}}{{\sum\limits_{h}^{\;}{\Pr\left\lbrack {W_{h}\;\bigwedge P_{jh}} \right\rbrack}}\;}} \\{\approx \frac{\sum\limits_{h}{{freq}\left( {W_{h},{P_{jh}\;{leaf}}} \right)}}{\sum\limits_{h}{{freq}\left( {W_{h},P_{jh}}\; \right)}}}\end{matrix} & {{Eq}.\mspace{14mu} 29} \\{{\Pr_{mean}\left\lbrack {{P_{j}\;{leaf}}❘{Y_{i}\;{{root}\bigwedge P_{j}}}} \right\rbrack} \approx \left\{ \begin{matrix}0 & {{{{if}\mspace{14mu}{dist}\left( {P,Y_{i}} \right)} + 1} \leq d_{\max}} \\1 & {otherwise}\end{matrix} \right.} & {{Eq}.\mspace{14mu} 30} \\{{\Pr_{mean}\left\lbrack {{P_{j}\;{leaf}}❘{Y_{i}\;{{root}\bigwedge P_{j}}}} \right\rbrack} \approx \left\{ \begin{matrix}0 & {{{{if}\mspace{14mu}{dist}\left( {P,X} \right)} + 1} \leq d_{\max}} \\1 & {otherwise}\end{matrix} \right.} & {{Eq}.\mspace{14mu} 31}\end{matrix}$

A preferred procedure for determining context child nodes for a parentnode P_(j) given a root element Y_(i) for the general case where P_(j)does not lie along the directed path from Y_(i) to the hit node X withprobability averaging is similar to that shown in FIG. 6, and is shownin FIG. 13. The method 13000 begins at step 13001 where a fictitiouschild node C₀ is conceptually created and added to the list of actualchild nodes C₁, . . . , C_(m)and is assigned a probability valueQ₀=Pr_(mean)[P_(j)leaf|Y_(i)root^P_(j)] computed using Eq. 27, Eq. 28,Eq. 29, Eq. 30 or Eq. 31 and at the next step 13005, the actual childnodes C_(k) are correspondingly assigned probability valuesQ_(k)=Pr_(mean)[C_(k)|Y_(i)root^P_(j)] computed using Eq. 22, Eq. 23,Eq. 24, Eq. 25 or Eq. 26 respectively. In any case, step 13006 followsstep 13005 and invokes the method 5000 at step 5010 (skipping step 5005)to select among the child nodes C₀, . . . , C_(m)a set of context nodes.When method 5000 exits, the method 13000 resumes at decision step 13010where a check is made to determine if the fictitious child node C₀ hasbeen selected as a context node. If so then execution continues at step13020 where C₀ is excluded as a context node. The method 13000subsequently terminates at step 13015. If the test at 13010 fails, thenthe method proceeds directly to the termination step 13015.

Apart from their use in keyword searching, the methods 13000 and 6000can also be used as means of selective presentation of hierarchicaldata. As already discussed, a practical hierarchical data sourcetypically contains much more data than a user may wish to see at anygiven time. When a user views a hierarchical data source by selecting anode within its the data structure, a presentation application typicallydisplays all data items in the sub-tree below the selected node, some ofwhich may often not be of interest to the user. It would be highlydesirable if the presentation application is able to filter outuninteresting data based on some previously observed viewing patterns ofthe user. The methods 13000 and 6000 as described are well suited forthis task. By setting Y_(i)=root node selected for viewing by the user,the set of context nodes identified by the methods constitute nodes thatare likely to be of interest and preferably be displayed to the user,whilst the remaining nodes not identified as context nodes arepreferably filtered out.

For the special case where a parent node P_(j) lies on the directed pathfrom the root node Y_(i) to the hit node X, recall that the selection ofchild nodes C_(k) of Y_(i) for inclusion as context nodes is based onthe probabilities

$\begin{matrix}{{\Pr\left\lbrack C_{k} \middle| {X\bigwedge P_{j}} \right\rbrack} = \frac{\Pr\left\lbrack {C_{k}\bigwedge X\bigwedge P_{j}} \right\rbrack}{\Pr\left\lbrack {X\bigwedge P_{j}} \right\rbrack}} & {{Eq}.\mspace{14mu} 32}\end{matrix}$

With probability averaging these are replaced by a mean probability:

$\begin{matrix}{{\Pr_{mean}\left\lbrack C_{k} \middle| {X\bigwedge P_{j}} \right\rbrack} = {\frac{\sum\limits_{h}^{\;}{\Pr\left\lbrack {C_{kh}\bigwedge X_{h}\bigwedge P_{jh}} \right\rbrack}}{\sum\limits_{h}^{\;}{\Pr\left\lbrack {X_{h}\bigwedge P_{jh}} \right\rbrack}} \approx \frac{\sum\limits_{h}^{\;}{{freq}\left( {C_{kh},P_{jh},X_{h}} \right)}}{\sum\limits_{h}^{\;}{{freq}\left( {P_{jh},X_{h}} \right)}}}} & {{Eq}.\mspace{14mu} 33}\end{matrix}$where (P_(jh), C_(kh)) is equivalent to (P_(j), C_(k)) and (P_(jh),X_(h)) is equivalent to (P_(j), X), with P_(j0), C_(k0) and X₀ (ie. h=0)being aliases for P_(j), C_(k) and X respectively.

For the above expression to be an accurate approximation of the meanprobability Pr_(mean)[C_(k)|X^P_(j)]X the denominator on the right handside of Eq. 33 must be sufficiently large (eg. >f_(min)). When this isnot the case, another remedial method that may be used is to approximatePr_(mean)[C_(k)|X^P_(j)] by Pr_(mean)[C_(k)|X′^P_(j)], a probabilityconditioned on X′ rather than X, where X′ is the immediate parent of Xlying on the directed path from Y_(i) to X.

A flowchart of a method 22000 for identifying a node X′ used fordetermining an approximation for Pr_(mean)[C_(k)|X^P_(j)] is shown inFIG. 22. The method 22000 begins at step 22005 where X′ is firstinitialised to X. At the next step 22010 the sum

$\sum\limits_{h}^{\;}{{freq}\left( {P_{jh},X_{h}^{\prime}} \right)}$is computed and assigned to D, where the node pairs (P_(jh), X′_(h)) areequivalent to (P_(j), X′). Decision step 22015 then follows and test ifD is greater than or equal to some positive threshold constant f_(min).If so, the method 22000 exits with success at step 22025. If thedecision step 22015 fails then execution proceeds to another decisionstep 22030, where a test is made to determine whether X′ is an immediatechild of P_(j). If so then the method exits with failure at step 22035,otherwise it continues at step 22040 where X′ is replaced by its parentlying along the directed path from P_(j) to X. The method 22000 thenloops back to step 22010.

If method 22000 succeeds with a node X′ and a corresponding value D,then Pr_(mean)[C_(k)|X^P_(j)] is assigned the value

$\begin{matrix}{{\Pr_{mean}\left\lbrack C_{k} \middle| {X\bigwedge P_{j}} \right\rbrack} \approx \frac{\sum\limits_{h}^{\;}{{freq}\left( {C_{kh},P_{jh},X_{h}^{\prime}} \right)}}{D}} & {{Eq}.\mspace{14mu} 34}\end{matrix}$

In the event that method 22000 exits with failure,Pr_(mean)[C_(k)|X^P_(j)] is assigned a value based on the distancebetween C_(k) and Y_(i)

$\begin{matrix}{{\Pr_{mean}\left\lbrack C_{k} \middle| {X\bigwedge P_{j}} \right\rbrack} \approx \left\{ \begin{matrix}1 & {{{if}\mspace{14mu}{dist}\left( {C_{k},Y_{i}} \right)} \leq d_{\max}} \\0 & {otherwise}\end{matrix} \right.} & {{Eq}.\mspace{14mu} 35}\end{matrix}$or the distance between C_(k) and the hit node X

$\begin{matrix}{{\Pr_{mean}\left\lbrack C_{k} \middle| {X\bigwedge P_{j}} \right\rbrack} \approx \left\{ \begin{matrix}1 & {{{if}\mspace{14mu}{dist}\left( {C_{k},X} \right)} \leq d_{\max}} \\0 & {otherwise}\end{matrix} \right.} & {{Eq}.\mspace{14mu} 36}\end{matrix}$

Depending on whether Pr_(mean)[C_(k)|X^P_(j)] is eventually approximatedby Eq. 34, Eq. 35 or Eq. 36, the mean probability that a parent nodeP_(j) has no context child nodes other than the child node C₁ lying onthe directed path from P_(j) to X, given P_(j) and the hit node X, iscomputed using Eq. 37, Eq. 38, or Eq. 39 respectively:

$\begin{matrix}{{\Pr_{mean}\left\lbrack {C_{1}\mspace{14mu}{no}\mspace{14mu}{sibling}} \middle| {P_{j}\bigwedge X} \right\rbrack} \approx \frac{\sum\limits_{h}^{\;}{{freq}\left( {{P_{jh}\mspace{14mu}{has}\mspace{14mu} 1\mspace{14mu}{child}},X_{h}^{\prime}}\; \right)}}{D}} & {{Eq}.\mspace{14mu} 37} \\{{\Pr_{mean}\left\lbrack {C_{1}\mspace{14mu}{no}\mspace{14mu}{sibling}} \middle| {P_{j}\bigwedge X} \right\rbrack} \approx \left\{ \begin{matrix}0 & {{{{if}\mspace{14mu}{dist}\left( {P,Y_{i}} \right)} + 1} \leq d_{\max}} \\1 & {otherwise}\end{matrix} \right.} & {{Eq}.\mspace{14mu} 38} \\{{\Pr_{mean}\left\lbrack {C_{1}\mspace{14mu}{no}\mspace{14mu}{sibling}} \middle| {P_{j}\bigwedge X} \right\rbrack} \approx \left\{ \begin{matrix}0 & {{{{if}\mspace{14mu}{dist}\left( {P,X} \right)} + 1} \leq d_{\max}} \\1 & {otherwise}\end{matrix} \right.} & {{Eq}.\mspace{14mu} 39}\end{matrix}$where (P_(jh), X′_(h)) are equivalent to (P_(j), X′), and X′ and D areobtained by method 22000.

A preferred procedure for determining context child nodes for a parentnode P_(j) for the special case where P_(j) lies along the directed pathfrom Y_(i) to the hit node X with probability averaging is very similarto that shown in FIG. 8, and is shown in FIG. 14.

A method 14000 shown in FIG. 14 begins at step 14001 where a fictitiouschild node C₀ is conceptually created and added to the list of actualchild nodes C₁, . . . , C_(m) and is assigned a probability valueQ₀=Pr_(mean)[C₁ no sibling|P_(j)^X] computed using Eq. 37, Eq. 38 or Eq.39, and at the next step 14005, the actual child nodes C_(k) except C₁are correspondingly assigned probability valuesQ_(k)=Pr_(mean)[C_(k)|X^P_(j)] computed using Eq. 34, Eq. 35 or Eq. 36respectively. In any case, step 14006 follows step 14005 and invokesmethod 7000 at step 7005 (skipping step 7001) to select among the childnodes C₀, . . . , C_(m) a set of context nodes. When the method 7000exits, the method 14000 resumes at decision step 14010 where a check ismade to determine if the fictitious child node C₀ has been selected as acontext node. If so then execution continues at step 14020 where C₀ isexcluded as a context node. The method 14000 subsequently terminates atstep 14015. If the test at 14010 fails, then the method proceedsdirectly to the termination step 14015.

The preceding discussion describes methods for identifying context nodesin the special case where there is at most a single hit node in theschema graph. This is a usual scenario when the user enters only asingle search keyword. In the event that the keyword appears in multiplelocations in the schema graph, signifying there are more than one hit,then each hit is preferably treated separately. That is, the methods asdescribed are applied for a first hit node in the schema graph and aplurality of context trees are determined for the hit node. The samemethods are then subsequently applied for each of the remaining hitnodes to obtain a new plurality of context trees, and so on. When allhit nodes have been processed, the generated context trees may bere-scored if they are found to encompass multiple hit nodes, and inaddition duplicated context trees are removed. The list of the remainingcontext trees are then reordered according to their new scores (if any)and returned to the user as the result of the keyword search operation.

If the user however initiates a ‘find all’ keyword search operationinvolving multiple search keywords combined with a Boolean ANDoperation, then keyword hits can potentially appear in two or more hitnodes in the schema graph. A more general method for determining contexttrees is now described for handling such a scenario.

FIG. 11 shows an example of a schema graph 11000 within which there aremultiple hit nodes 11010, 11020 and 11025. Let these hit nodes bedenoted by X₁, . . . , X_(n). Naturally, for a context tree to includeall hit nodes, the root node of the smallest sub-tree containing all hitnodes, denoted by A (node 11005) must be returned as a context node, aswell as all nodes lying along the directed path from A to each hit node.Thus node 11015 must be a context node since it lies along the directedpath from A to X₂ (and from A to X₃).

The first, bottom-up phase of the context tree determination methodbegins at node A and traverses upwards. Let Y_(i) be A or an ancestor ofA, whose probability given the hit nodes, Pr[Y_(i)|X₁^ . . . ^X_(n)],needs to be evaluated in order to determine the possible root node of acontext tree. Expressed mathematically

$\begin{matrix}{{\Pr\left\lbrack {Y_{i}❘{X_{1}\bigwedge\ldots\bigwedge X_{n}}} \right\rbrack} = \frac{\Pr\left\lbrack {Y_{i}\bigwedge X_{1}\bigwedge\ldots\bigwedge X_{n}} \right\rbrack}{\Pr\left\lbrack {X_{1}\bigwedge\ldots\bigwedge X_{n}} \right\rbrack}} & {{Eq}.\mspace{14mu} 40}\end{matrix}$

At this point some independence of probability assumptions are necessarysince both the numerator and denominator on the right hand side of Eq.40 cannot be obtained or estimated directly from the existing frequencytables for a general value of n (except for the denominator when n≦2). Aplausible assumption is that the set of X_(l) are independent of oneanother given a common ancestor Y_(i). In other words:

$\begin{matrix}{{{\Pr\left\lbrack {{X_{1}\bigwedge\ldots\bigwedge X_{n}}❘Y_{i}} \right\rbrack} = {{\Pr\left\lbrack {X_{1}❘Y_{i}} \right\rbrack}\mspace{11mu}\ldots\mspace{11mu}{\Pr\left\lbrack {X_{n}❘Y_{i}} \right\rbrack}}}{Thus}} & {{Eq}.\mspace{14mu} 41} \\\begin{matrix}{{\Pr\left\lbrack {Y_{i}\bigwedge X_{1}\bigwedge\ldots\bigwedge X_{n}} \right\rbrack} = {{\Pr\left\lbrack {{X_{1}\bigwedge\ldots\bigwedge X_{n}}❘Y_{i}} \right\rbrack}{\Pr\left\lbrack Y_{i} \right\rbrack}}} \\{= {{\Pr\left\lbrack {X_{1}❘Y_{i}} \right\rbrack}\mspace{11mu}\ldots\mspace{11mu}{\Pr\left\lbrack {X_{n}❘Y_{i}} \right\rbrack}{\Pr\left\lbrack Y_{i} \right\rbrack}}} \\{= \frac{{\Pr\left\lbrack {X_{1}\bigwedge Y_{i}} \right\rbrack}\mspace{11mu}\ldots\mspace{11mu}{\Pr\left\lbrack {X_{n}\bigwedge Y_{i}} \right\rbrack}}{{\Pr\left\lbrack Y_{i} \right\rbrack}^{n - 1}}}\end{matrix} & {{Eq}.\mspace{14mu} 42}\end{matrix}$

In order to remove the singularity when Pr[Y_(i)]=0, Pr[Y_(l)^X₁^ . . .^X_(n)] is redefined as

$\begin{matrix}{{\Pr\left\lbrack {Y_{i}\bigwedge X_{1}\bigwedge\ldots\bigwedge X_{n}} \right\rbrack} = \left\{ {\begin{matrix}0 & {{{if}\mspace{14mu}{\Pr\left\lbrack Y_{i} \right\rbrack}} = 0} \\\frac{{\Pr\left\lbrack {X_{1}\bigwedge Y_{i}} \right\rbrack}\mspace{11mu}\ldots\mspace{11mu}{\Pr\left\lbrack {X_{n}\bigwedge Y_{i}} \right\rbrack}}{{\Pr\left\lbrack Y_{i} \right\rbrack}^{n - 1}} & {otherwise}\end{matrix}{Similarly}} \right.} & {{Eq}.\mspace{14mu} 43} \\\begin{matrix}{{\Pr\left\lbrack {X_{1}\bigwedge\ldots\bigwedge X_{n}} \right\rbrack} = {\Pr\left\lbrack {A\bigwedge X_{1}\bigwedge\ldots\bigwedge X_{n}} \right\rbrack}} \\{= {{\Pr\left\lbrack {X_{1}\bigwedge\ldots\bigwedge X_{n}} \middle| A \right\rbrack}{\Pr\lbrack A\rbrack}}} \\{= {{\Pr\left\lbrack X_{1} \middle| A \right\rbrack}\mspace{11mu}\ldots\mspace{11mu}{\Pr\left\lbrack X_{n} \middle| A \right\rbrack}{\Pr\lbrack A\rbrack}}} \\{= \frac{{\Pr\left\lbrack {X_{1}\bigwedge A} \right\rbrack}\mspace{11mu}\ldots\mspace{11mu}{\Pr\left\lbrack {X_{n}\bigwedge A} \right\rbrack}}{{\Pr\lbrack A\rbrack}^{n - 1}}}\end{matrix} & {{Eq}.\mspace{14mu} 44}\end{matrix}$

As in the case where there is only a single hit node, the occurrenceprobability of Y_(i) given all hit nodes is preferably expressedincrementally in terms of the probability of its immediate child node tofacilitate probability averaging, as follows: Let W denote the immediatechild node of Y_(i) along the directed path from Y_(i) to A, then

$\begin{matrix}{{\Pr^{\prime}\left\lbrack Y_{i} \middle| {X_{1}\bigwedge\ldots\bigwedge X_{n}} \right\rbrack} = \left\{ {\begin{matrix}1 & {Y_{i} = A} \\0 & {{Y_{i} \neq A},{{\Pr^{\prime}\left\lbrack W \middle| {X_{1}\bigwedge\ldots\bigwedge X_{n}} \right\rbrack} = 0}} \\\begin{matrix}{{\Pr_{mean}\left\lbrack Y_{i} \middle| {W\bigwedge X_{1}\bigwedge\ldots\bigwedge X_{n}} \right\rbrack} \cdot} \\{\Pr^{\prime}\left\lbrack W \middle| {X_{1}\bigwedge\ldots\bigwedge X_{n}} \right\rbrack}\end{matrix} & {otherwise}\end{matrix}{where}} \right.} & {{Eq}.\mspace{14mu} 45} \\{{\Pr_{mean}\left\lbrack Y_{i} \middle| {W\bigwedge X_{1}\bigwedge\ldots\bigwedge X_{n}} \right\rbrack} = \frac{\sum\limits_{h}{\Pr\left\lbrack {Y_{ih}\bigwedge W_{h}\bigwedge X_{1h}\bigwedge\ldots\bigwedge X_{nh}} \right\rbrack}}{\sum\limits_{h}{\Pr\left\lbrack {W_{h}\bigwedge X_{1h}\bigwedge\ldots\bigwedge X_{nh}} \right.}}} & {{Eq}.\mspace{14mu} 46}\end{matrix}$where the pairs (Y_(ih), W_(h)) are equivalent to (Y_(i), W), and(W_(h), X_(lh)) are equivalent to (W, X_(l)) for l=1, . . . , n, andY_(i0), W₀ and X_(l0) (h=0) are aliases for Y_(i), W, and X_(l)respectively. The term inside the summation on the numerator can besubstituted by Eq. 43. The term inside the summation of the denominatorcan also be substituted by Eq. 43 by letting W play the role of Y_(i),thus resulting in

$\begin{matrix}{{{\Pr_{mean}\left\lbrack Y_{i} \middle| {W\bigwedge X_{1}\bigwedge\ldots\bigwedge X_{n}} \right\rbrack} = {\sum\limits_{h}{N_{h}/{\sum\limits_{h}D_{h}}}}}{where}} & {{Eq}.\mspace{14mu} 47} \\\begin{matrix}{N_{h} = \left\{ \begin{matrix}0 & {{{if}\mspace{14mu}{\Pr\left\lbrack Y_{ih} \right\rbrack}} = 0} \\\frac{{\Pr\left\lbrack {Y_{ih}\bigwedge X_{1h}} \right\rbrack}\mspace{11mu}\ldots\mspace{11mu}{\Pr\left\lbrack {Y_{ih}\bigwedge X_{nh}} \right\rbrack}}{{\Pr\left\lbrack Y_{ih} \right\rbrack}^{n - 1}} & {otherwise}\end{matrix} \right.} \\{\approx \left\{ \begin{matrix}0 & {{{if}\mspace{14mu}{{freq}\left( Y_{ih} \right)}} = 0} \\\frac{{{freq}\left( {Y_{ih},X_{1h}} \right)}\mspace{11mu}\ldots\mspace{11mu}{{freq}\left( {Y_{ih},X_{nh}} \right)}}{{{freq}\left( Y_{ih} \right)}^{n - 1}} & {otherwise}\end{matrix} \right.} \\{D_{h} = \left\{ \begin{matrix}0 & {{{if}\mspace{14mu}{\Pr\left\lbrack W_{h} \right\rbrack}} = 0} \\\frac{{\Pr\left\lbrack {W_{h}\bigwedge X_{1h}} \right)}\mspace{11mu}\ldots\mspace{11mu}{\Pr\left\lbrack {W_{h}\bigwedge X_{nh}} \right\rbrack}}{{\Pr\left\lbrack W_{h} \right\rbrack}^{n - 1}} & {otherwise}\end{matrix} \right.} \\{\approx \left\{ \begin{matrix}0 & {{{if}\mspace{14mu}{{freq}\left( W_{h} \right)}} = 0} \\\frac{{{freq}\left( {W_{h},X_{1h}} \right)}\mspace{11mu}\ldots\mspace{11mu}{{freq}\left( {W_{h},X_{nh}} \right)}}{{{freq}\left( W_{h} \right)}^{n - 1}} & {otherwise}\end{matrix} \right.}\end{matrix} & \begin{matrix}{{Eq}.\mspace{14mu} 48} \\{{Eq}.\mspace{14mu} 49}\end{matrix}\end{matrix}$

Pr_(mean)[Y_(l)|W^X₁^ . . . ^X_(n)] is undefined if

${\sum\limits_{h}D_{h}} = 0.$When this occurs Pr_(mean)[Y_(i)|W^X₁^ . . . ^X_(n)] is preferablyassigned a value based on the distances from Y_(i) to the hit nodes X₁,. . . , X_(n) as follows

$\begin{matrix}{{\Pr_{mean}\left\lbrack Y_{i} \middle| {W\bigwedge X_{1}\bigwedge\ldots\bigwedge X_{n}} \right\rbrack} = \left\{ \begin{matrix}1 & {{{if}\mspace{14mu}{\min\limits_{{l = 1},\;\ldots\mspace{11mu},n}\;{{dist}\left( {Y_{i},X_{l}} \right)}}} \leq d_{\max}} \\0 & {otherwise}\end{matrix} \right.} & {{Eq}.\mspace{14mu} 50}\end{matrix}$

A flowchart of a method 12000 for computing the probability that a nodeY_(i) is the root node of a context tree containing all hit nodes, forall choices of Y_(i), is shown in FIG. 12. The method 12000 begins atstep 12001 where the root node of the smallest subtree in the schemagraph that contains all hit nodes is identified and denoted as A.Execution then continues at step 12002 where Y_(i) is initialised to Aand consequently Pr′[Y_(i)|X₁^ . . . ^X_(n)]=1. At the next step 12005Eq. 45, Eq. 47 together with Eq. 48 and Eq. 49, or alternatively Eq. 50,are used to compute Pr′[Z_(j)|X_(l)^ . . . ^X_(n)] for each parent nodeZ_(j) of Y_(i). Following step 12005, step 12010 computes Pr′[Y_(i)root|X₁^ . . . ^X_(n)] according to the equation

$\begin{matrix}{{\Pr^{\prime}\left\lbrack {Y_{i}\mspace{14mu}{root}} \middle| {X_{1}\bigwedge\ldots\bigwedge X_{n}} \right\rbrack} = {{\Pr^{\prime}\left\lbrack Y_{i} \middle| {X_{1}\bigwedge\ldots\bigwedge X_{n}} \right\rbrack} - {\sum\limits_{j}{\Pr^{\prime}\left\lbrack Z_{j} \middle| {X_{1}\bigwedge\ldots\bigwedge X_{n}} \right\rbrack}}}} & {{Eq}.\mspace{14mu} 51}\end{matrix}$

The method 12000 then proceeds to step 12015 where a parent node Z_(j)of Y_(i) is selected. Upon reaching step 12020, the method 12000 isrecursively invoked at step 12005 (skipping Steps 12001 and 12002) butwith the selected parent node Z_(j) playing the role of Y_(i). When therecursive invocation returns, execution resumes at decision step 12025where a test is made to determine whether all parent nodes of Y_(i) hasbeen processed. If so, the method ends at step 12030, otherwise itcontinues at step 12015 where another parent node Z_(j) of Y_(i) isselected for processing.

In the second top-down traversal phase, for the general case where aparent node P_(j) does not lie along the directed path from the rootnode Y_(i) to any hit node, the method for determining whether a childnode of P_(j) is a context node remains unchanged from the method 13000used previously for the case where there is only a single hit node.

For the special case where the parent node P_(j) lies along the directedpath from Y_(i) to one or more hit nodes, it is necessary to modify themethod used for the single hit node case to allow for the possibilitythat more than one child node of P_(j) must be included as contextnodes. Recall that for the case involving only one hit node X, thedetermination of whether a child node C_(k) is a context node is basedon the probability value

Pr[C_(k)|X^P_(j)]

where X is a descendant of P_(j) but is not C_(k) or a descendant ofC_(k). An extension when there are more than one hit nodes is to basethe selection process of child nodes C_(k) k=1, . . . , m on theprobabilities of C_(k) given the parent node P_(j) and all hit nodesX_(l) that are descendants of P_(j), whilst ignoring the effects of hitnodes that are not descendants of P_(j). Naturally, all child nodesC_(k) that are themselves hit nodes or are ancestors of one or more hitnodes must be context nodes. Without loss of generality, let these childnodes be C₁, . . . , C_(r), where 1≦r≦nm. S_(i)milarly let the set ofhit nodes that are descendants of C₁, . . . , C_(r) be X₁, . . . ,X_(s), where r≦s≦n. If s=1 (and hence r=1) then this scenario isequivalent to the case where there is only a single hit node, and hencethe method 14000 described for this case can be used. A method adoptedin a preferred implementation for generalising for the case s>1 is toreplace the term Pr[C_(k)|X^P_(j)] with the expression

$\begin{matrix}{{\sum\limits_{l = 1}^{s}{\Pr\left\lbrack C_{k} \middle| {X_{l}\bigwedge P_{j}} \right\rbrack}}{{which}\mspace{14mu}{becomes}}} & \; \\{Q_{k} = {\sum\limits_{l = 1}^{s}{\Pr_{mean}\left\lbrack C_{k} \middle| {X_{l}\bigwedge P_{j}} \right\rbrack}}} & {{Eq}.\mspace{14mu} 52}\end{matrix}$after probability averaging, where Pr_(mean)[C_(k)|X_(l)^P_(j)] is asdefined in Eq. 33. Q_(k) is undefined if freq_(mean)(P_(j),X_(l))=0 forany X_(l), l=1, . . . , s.where

${{freq}_{mean}\left( {P_{j},X_{l}} \right)} = {\sum\limits_{h}{{freq}\left( {P_{jh},X_{lh}} \right)}}$and (P_(jh), X_(lh)) are equivalent to (P_(j), X_(l)). Even whenfreq_(mean)(P_(j), X_(l)) is non-zero but a small number (eg. <f_(min)),Q_(k) cannot be estimated from the frequency tables with sufficientaccuracy. As in the case involving a single hit node, this problem canbe overcome by replacing the hit nodes X₁, . . . , X_(s) by a new set ofnodes S in which some or all of the hit nodes are replaced by theirancestors, after which Q_(k) is redefined in terms of elements of S.

A method 21000 depicted by the flowchart of FIG. 21 is preferably usedto determine this new set of nodes S that replaces the hit nodes X₁, . .. , X_(s). The method 21000 begins at step 21005 where the initial setof hit nodes X₁, . . . , X_(s) is denoted by S. At the next step 21010an unprocessed element X_(p) of S is selected. Decision step 21015 thenfollows in which a check is made to determine iffreq_(mean)(P_(j),X_(p)) is greater than or equal to some thresholdconstant f_(min). If so then X_(p) is retained in the set S and themethod 21000 continues to decision step 21020 where a check is made todetermine if all elements in S has been processed. If one or moreunprocessed elements remain then execution returns to step 21010 toselect another element of S for processing. If on the other hand allelements have been processed, then the method ends at step 21025 withsuccess.

Returning now to decision step 21015. If the test condition fails thenanother decision step 21030 follows, which tests if the selected nodeX_(p) is a child node of P_(j). If it is then the method 21000 ends atstep 21040 with failure, otherwise step 21035 follows. At step 21035,the element X_(p) in S is replaced by its parent X′_(p) that lies alongthe directed path from P_(j) to X_(p), and all descendants of X′_(p) areremoved from S. Execution then proceeds to step 21020.

If method 21000 as described above returns with success, then theelements in the resulting set S are used to compute a value Q_(k) foreach child node C₁, . . . , C_(r):

$\begin{matrix}{Q_{k} = {\sum\limits_{X \in S}{\Pr_{mean}\left\lbrack C_{k} \middle| {X\bigwedge P_{j}} \right\rbrack}}} & {{Eq}.\mspace{14mu} 53}\end{matrix}$

If however method 21000 returns with failure, then the value Q_(k) ispreferably determined from the distance between C_(k) and the root nodeY_(i):

$\begin{matrix}{Q_{k} = \left\{ \begin{matrix}1 & {{{dist}\left( {C_{k},Y_{i}} \right)} \leq d_{\max}} \\0 & {otherwise}\end{matrix} \right.} & {{Eq}.\mspace{14mu} 54}\end{matrix}$or alternatively the distances between C_(k) and the hit nodes X₁, . . ., X_(n):

$\begin{matrix}{Q_{k} = \left\{ \begin{matrix}1 & {{\min\limits_{\;{{l\; = \; 1},\;\ldots,\; n}}{{dist}\left( {C_{\; k},X_{\; l}} \right)}} \leq d_{\;\max}} \\0 & {otherwise}\end{matrix} \right.} & {{Eq}.\mspace{14mu} 55}\end{matrix}$

Recall also that in the case where there is a single hit node X, it isnecessary to evaluate the probability that given a parent node P_(j) anda hit node X, only one child node C₁ of P_(j) occurs, where C₁ is or isan ancestor of X:Pr[C ₁no sibling|X^P _(j)]

In generalising this quantity to the present scenario, two possibilitiescan arise, namely the special case where r=1, and the more general caser>1. An example of the former is shown in FIG. 15 where, there are threehit nodes 15030, 15035 and 15040 located within the sub-tree rooted atnode 15005. However, all three hit nodes reside under a single childnode 15010 of node 15005.

An approach for handling this special case is to replace the term Pr[C₁no sibling|X^P_(j)] with the expression

$\begin{matrix}{Q_{0} = {\sum\limits_{l = 1}^{\; s}{\Pr_{mean}\left\lbrack {{C_{1}\mspace{14mu}{no}\mspace{14mu}{sibling}}❘{X_{l}\bigwedge P_{j}}} \right\rbrack}}} & {{Eq}.\mspace{14mu} 56}\end{matrix}$in an analogous fashion to the use of the quantity Q_(k) in Eq. 52. Asin the case of Q_(k) above, there is a possibility that Eq. 56 isundefined when freq_(mean)(P_(j),X_(l))=0 for any X_(l), l=1, . . . , s.Consequently, the actual value assigned to Q₀ is based on the set Sobtained from the method 21000, if the method 21000 returns withsuccess:

$\begin{matrix}{Q_{0} = {\sum\limits_{X \in S}^{\;}{\Pr_{mean}\left\lbrack {{C_{1}\mspace{14mu}{no}\mspace{14mu}{sibling}}❘{X\bigwedge P_{j}}} \right\rbrack}}} & {{Eq}.\mspace{14mu} 57}\end{matrix}$

Otherwise if the method 21000 fails, then Q₀ is assigned a value basedon the distance of P_(j) from the root node Y_(i):

$\begin{matrix}{Q_{0} = \left\{ \begin{matrix}0 & {{{{dist}\left( {P_{\; j},Y_{\; i}} \right)} + 1} \leq d_{\;\max}} \\1 & {otherwise}\end{matrix} \right.} & {{Eq}.\mspace{14mu} 58}\end{matrix}$or alternatively the distances between P_(j) and the hit nodes X₁, . . ., X_(n):

$\begin{matrix}{Q_{0} = \left\{ \begin{matrix}0 & {{{\min\limits_{\;{{l\; = \; 1},\;\ldots,\; n}}{{dist}\left( {P_{\; j},X_{\; l}} \right)}} + 1} \leq d_{\;\max}} \\1 & {otherwise}\end{matrix} \right.} & {{Eq}.\mspace{14mu} 59}\end{matrix}$

A method 16000 depicted in the flowchart of FIG. 16 for identifyingcontext nodes among the set of child nodes C_(k) of a parent node P_(j)for the case r=1, s>1 is very similar to the method 14000 for the singlehit node case. The method 16000 begins at step 16001 where a fictitiouschild node C₀ is conceptually created and added to the list of actualchild nodes C₁, . . . , C_(m) and is assigned a value Q₀ defined in Eq.57, Eq. 58, or Eq. 59, and at the next step 16005, the actual childnodes C_(k) except C₁ are assigned values Q_(k) correspondingly definedin Eq. 53, Eq. 54, or Eq. 55 respectively. Step 16006 follows step 16005and invokes the method 7000 at step 7005 (skipping step 7001) to selectamong the child nodes C₀, . . . , C_(m) a set of context nodes. When themethod 7000 exits, the method 16000 resumes at decision step 16010 wherea check is made to determine if the fictitious child node C₀ has beenselected as a context node. If so then execution continues at step 16020where C₀ is excluded as a context node. The method 16000 subsequentlyterminates at step 16015. If the test at 16010 fails, then the method16000 proceeds directly to the termination step 16015.

For the general case where r>1 (and hence s>1), an analogous quantity toPr[C₁ no sibling|X₁^ . . . ^X_(s)^P_(j)] used in the case r=1 is

$\sum\limits_{X \in S}^{\;}{\Pr\left\lbrack {{C_{1}\bigwedge\ldots\bigwedge C_{r}\bigwedge{⫬ {C_{r + 1}\bigwedge\ldots\bigwedge{⫬ C_{m}}}}}❘{X\bigwedge P_{j}}} \right\rbrack}$where S is the set returned by the method 21000 if it exits withsuccess. Unfortunately the probability in the summation cannot be easilyestimated from the existing frequency tables. Consequently, a slightlydifferent expression is used in its place. Let the elements of set Sthat are not located in the sub-tree rooted at each child node C_(k),1≦k≦r be denoted by H_(kl) for 1≦l≦s_(k), where s_(k)≦|S|. For eachchild node C_(k), 1≦k≦r, the following is computed:

$\begin{matrix}{Q_{k} = {\sum\limits_{l = 1}^{s_{k}}{\Pr_{mean}\left\lbrack {C_{k}❘{H_{kl}\bigwedge P_{j}}} \right\rbrack}}} & {{Eq}.\mspace{14mu} 60}\end{matrix}$

The rationale behind the quantity above expression is that when summedtogether over all C_(k), 1≦k≦r, a quantity approximating the probabilityof child nodes C₁, . . . , C_(r) occurring together is obtained(although not a true probability since it can take on a value >1). As inthe case r=1, if method 21000 returns with failure then Q_(k) isobtained from the distance of P_(j) to from the root node Y_(i), for1≦k≦r:

$\begin{matrix}{Q_{k} = \left\{ \begin{matrix}0 & {{{{dist}\left( {P_{\; j},Y_{\; i}} \right)} + 1} \leq d_{\;\max}} \\1 & {otherwise}\end{matrix} \right.} & {{Eq}.\mspace{14mu} 61}\end{matrix}$or alternatively the distances between P_(j) and the hit nodes X₁, . . ., X_(n):

$\begin{matrix}{Q_{k} = \left\{ \begin{matrix}0 & {{{\min\limits_{\;{{l\; = \; 1},\;\ldots,\; n}}{{dist}\left( {P_{\; j},X_{\; l}} \right)}} + 1} \leq d_{\;\max}} \\1 & {otherwise}\end{matrix} \right.} & {{Eq}.\mspace{14mu} 62}\end{matrix}$

A method 17000 for selecting context nodes among the set of child nodesC₁, . . . , C_(m) is now described for the general case r>1, s>1, withreference to the flowchart of FIG. 17. Method 17000 begins at step 17001where each child node C_(k), 1≦k≦r is assigned a value Q_(k) computedusing Eq. 60, Eq. 61, or Eq. 62. Step 17005 follows in which theremaining child nodes are assigned values Q_(k) correspondingly computedusing Eq. 53, Eq. 54, or Eq. 55 respectively. At the next step 17010,the values Q_(k) are summed over all child nodes and denoted by T. Themethod 17000 continues at step 17015 where all child nodes containinghit nodes in their sub-trees, namely C_(k), 1≦k≦r, are selected ascontext nodes. At the next step 17020, nodes C_(k) with the highestassigned value among the remaining child nodes are also selected ascontext nodes. If more than one child node exists with the same highestvalue then all such nodes are selected as context nodes. The sum of theassigned values of all child nodes so far selected as context nodes isthen computed at step 17025 and denoted by S. Execution then proceeds tothe decision step 17030, at which point if all child nodes C_(k) havebeen selected as context nodes then the method 17000 terminates at step17040. If however there are one or more child nodes C_(k) not yetselected as context nodes then the method 17000 continues to anotherdecision step 17035. At step 17035 a check is made to ascertain whetherS≧T/2 and if so the method 17000 again terminates at step 17040. IfS<T/2 then execution returns to step 17020 where more nodes are selectedas context nodes.

The preceding descriptions present various methods for handlingdifferent stages and operating scenarios encountered when performingkeyword searching in hierarchical data structures. These methods areincorporated into a single overall procedure 18000 which elaborates onstep 2010 of FIG. 2, illustrated by the flowchart in FIG. 18 whichcomprises sub procedures 19000 and 20000 shown in FIG. 19 and FIG. 20,respectively. The method 18000 begins at decision step 18005 where acheck is made to determine whether there are multiple hit nodes in theschema graph. If so then execution proceeds to step 18015 where themethod 20000 is invoked, otherwise it proceeds to step 18010 where themethod 19000 is invoked. In either case, the method 20000 or 19000returns with a list of context trees, each having an associated score.The following is a detailed description of the method 19000, followed bythat of the method 20000.

The method 19000 begins at step 19001 where the method 10000 is invokedto determine a list of possible root nodes Y_(i) that are ancestor nodesof the hit node X. Each Y_(i) is the root node of a possible contexttree. The method 10000 also computes a value S_(i)=Pr′[Y_(i)|X] for eachnode Y_(i). The method 19000 then continues at step 19005 where a nodeY_(i) determined in the previous step is selected for processing. At thenext step 19010, method 38000 which is a sub-process within method 19000is invoked to identify context nodes in the subtree rooted at nodeY_(i). Method 19000 then continues at step 19030, where a context treeis constructed comprising all identified context nodes and with Y_(i) asthe root node. The tree is assigned a score of S_(i) computed at step19001. The method 19000 then proceeds to decision step 19035. If allnodes Y_(i) obtained at step 19001 have been processed, then the methodends at step 19040, otherwise it returns to step 19005 to processanother node Y_(i).

The method 38000 invoked within method 19000 begins at step 38010 wherenode Y_(i) is first assigned to P_(j). Execution proceeds to thedecision step 38015 and then to step 38020 if P_(j) does not lie on thedirected path from Y_(i) to the hit node X. At step 38020, the method13000 is invoked to select among the child nodes of P_(j) a set ofcontext nodes. At the subsequent step 38025, the method 38000 isrecursively invoked at step 38020 (skipping steps 38010 and 38015) foreach non-leaf child node C_(k) selected as context node, with C_(k)playing the role of P_(j) in order to identify additional context nodesamong its descendants. When the invocations for all such child nodesreturn, method 38000 terminates at step 38040. Method 38000 alsoproceeds directly to the termination step 38040 if P_(j) has no childnodes, or if none of its non-leaf child nodes have been selected ascontext nodes at step 38020.

The decision step 38015 succeeds if P_(j) lies on the directed path fromY_(i) to X, in which case executions proceeds to step 38045. Here themethod 14000 is invoked to select among the child nodes of P_(j) a setof context nodes, with C₁ denoting the child node lying on the directedpath from P_(j) to X. A the subsequent step 38050, method 38000 isrecursively invoked at step 38015 (skipping step 38010) for eachnon-leaf child node C_(k) selected as context node, with C_(k) playingthe role of P_(j) in order to identify additional context nodes amongits descendants. When the invocations for all such child nodes return,method 38000 terminates at step 38040.

The method 20000 begins at step 20001 where the method 12000 is invokedto determine a list of possible root nodes Y_(i) that are ancestor nodesof the hit nodes X₁, . . . X_(n). Each Y_(i) is the root node of apossible context tree. The method 12000 also computes a valueS_(i)=Pr′[Y_(i)|X_(l)^ . . . X_(n)] for each node Y_(i). The method20000 then continues at step 20005 where a node Y_(i) determined in theprevious step is selected for processing. At the next step 20010, method39000 which is a sub-process within method 20000 is invoked to identifycontext nodes in the subtree rooted at node Y_(i). Method 20000 thencontinues at step 20060, where a context tree is constructed comprisingall identified context nodes and with Y_(i) as the root node. The treeis assigned a score of S_(i) computed at step 20001. The method 20000then proceeds to decision step 20065. If all nodes Y_(i) obtained atstep 20001 have been processed, then the method ends at step 20070,otherwise it returns to step 20005 to process another node Y_(i).

The method 39000 invoked within method 19000 begins at step 39010 wherenode Y_(i) is first assigned to P_(j). Execution proceeds to thedecision step 39015 and then to step 39020 if there are no hit nodes inthe sub-tree root at P_(j). At step 39020, the method 13000 is invokedto select among the child nodes of P_(j) a set of context nodes. At thesubsequent step 39025, the method 39000 is recursively invoked at step39020 (skipping steps 39010 and 39015) for each non-leaf child nodeC_(k) selected as context node, with C_(k) playing the role of P_(j) inorder to identify additional context nodes among its descendants. Whenthe invocations for all such child nodes return, method 39000 terminatesat step 39060. Method 39000 also proceeds directly to the terminationstep 39060 if P_(j) has no child nodes, or if none of its non-leaf childnodes have been selected as context nodes at step 39020.

The decision step 39015 succeeds if there is one or more hit nodeswithin the subtree rooted at P_(j), in which case execution proceeds toanother decision step 39030. If there is only a single hit node in thesub-tree under P_(j) then this decision step fails and executionproceeds to step 39035, otherwise it continues to yet another decisionstep 39040. At decision step 39040, a test is made to determine whetherall hit nodes under P_(j) are located under only one of its child nodes.If so, then execution proceeds to step 39045, otherwise it proceeds tostep 39050. At step 39050, with C₁, . . . , C_(r) denoting the childnodes of P_(j) under which one or more hit nodes reside, the method17000 is invoked to select among the child nodes of P_(j) a set ofcontext nodes. If however decision step 39040 leads to step 39045, thenthe method 16000 is invoked to select among the child nodes of P_(j) aset of context nodes, with C₁ being the sole child node of P_(j) thatcontains hit nodes in its sub-tree.

Returning now to step 39035, let the path from P_(j) to its one and onlydescendant hit node pass through its child node C₁. The method 14000 isinvoked to select among the child nodes of P_(j) a set of context nodes.

At the completion of each of steps 39035, 39045 and 39050, the method39000 recursively invokes itself at step 39015 (skipping step 39010) foreach non-leaf child node C_(k) selected as context node, with C_(k)playing the role of P_(j) in order to identify additional context nodesamong its descendants. When the invocations for all such child nodesreturn, method 39000 terminates at step 39060.

ILLUSTRATIVE EXAMPLE

The operation of a preferred implementation is now demonstrated with anexample hierarchical XML data source below. The XML source comprisesdata relating to a company named “XYZ” such as its web addresses, branchnames and locations, and its range of sales products at each branch. Aschema graph representation of the XML data is shown in FIG. 23.

XML SOURCE <company> <name>XYZ</name> <web>http://www.xyz.com</web><description> Company founded in 1999 specialising in hi-tech consumerselectronics </description> <branch> <name>North Ryde</name><phone>0291230000</phone> <address> <number>1</number> <street>LaneCove</street> <city>Sydney</city> <country>Australia</country></address> <manager> <firstName>Jim</firstName><lastName>Smith</lastName> <email>jsmith@xyz.com</email> </manager><product> <id>1</id> <name>Plasma TV</name> <price>$10000</price><supplier>JEC</supplier> <stock>10</stock> </product> <product><id>2</id> <name>Mp3 player</name> <price>$500</price><supplier>HG</supplier> <stock>20</stock> </product> </branch> <branch><name>Morley</name> <phone>0891230000</phone> <address><number>1</number> <street>Russel</street> <city>Perth</city><country>Australia</country> </address> <manager><firstName>Ted</firstName> <lastName>White</lastName><email>twhite@xyz.com</email> </manager> <product> <id>3</id><name>Video phone</name> <price>$2000</price> <supplier>NVC</supplier><stock>15</stock> </product> <product> <id>4</id> <name>PDA</name><price>$1000</price> <supplier>LP</supplier> <stock>50</stock></product> </branch> </company>

In FIG. 23, the integer shown next to each node is a unique ID numberassigned to the node. Suppose that there are three existing views ofthis data source. The first is a view displaying the company's name,description and web address. The second is a listing of the company'sbranches and their locations, and finally the third view lists the lineof products at each branch. Schema graph representations of these viewsare shown in FIG. 24, FIG. 25 and FIG. 26 respectively. As a result ofthese views, the occurrence 27000, co-occurrence 28000, leafco-occurrence 29000, and sole child co-occurrence 30000 frequency tablesare as shown in FIG. 27, FIG. 28, FIG. 29 and FIG. 30 respectively. Thejoint-occurrence frequency table, being three-dimensional, is depictedby five separate two-dimensional tables 31000, 32000, 33000, 34000 and35000. FIG. 31 comprises entries freq(C_(k), P_(j), X) in the table withP_(j)=node 1. S_(i)milarly FIG. 32, FIG. 33, FIG. 34 and FIG. 35 eachcomprises entries with P_(j)=node 3, node 8, node 9, and node 10respectively. In all frequency tables shown, an empty cell such as Item28005, as seen in FIG. 28, denotes an invalid node combination whoseassociated frequency is riot required to be stored.

uppose that a user wishes to locate a particular product in the citywhere the user resides. The user enters the product's name, “Mp3player”, and the name of the city, “Sydney” and performs a keywordsearch for both names. As seen from FIG. 23 this results in two hitnodes X₁=node 19 and X₂=node 13. To determine possible context trees forthe keyword search operation, the system 4000 invokes method 18000 ofFIG. 18. Since there is more than one hit node, the method 18000subsequently invokes the method 20000 at step 18015. The method 20000 inturn invokes the method 12000 at step 20001 to obtain a list of nodesY_(i) to serve as root nodes of the resulting context trees.

The method 12000 first identifies at step 12001 node 3 as the root nodeof the smallest sub-tree containing both hit nodes X₁ and X₂. ThusA=node 3. The method 12000 then begins a recursive procedure to computean occurrence probability value for each of A and its ancestors, giventhe hit nodes. At node APr′[A|X ₁ ^X ₂]=1At Y₁=node 1, the parent of node A, using Eq. 47, Eq. 48 and Eq. 49

${\Pr_{mean}\left\lbrack {{{node}\mspace{14mu} 1}❘{A\bigwedge X_{1}\bigwedge X_{2}}} \right\rbrack} = {\frac{{{freq}\left( {{{node}\mspace{14mu} 1},{{node}\mspace{14mu} 19}} \right)}\;{{freq}\left( {{{node}\mspace{14mu} 1},{{node}\mspace{14mu} 13}} \right)}\;{{freq}\left( {{node}\mspace{14mu} 3} \right)}}{{{freq}\left( {{{node}\mspace{14mu} 3},{{node}\mspace{14mu} 19}} \right)}\;{{freq}\left( {A,{{node}\mspace{14mu} 13}} \right)}\;{{freq}\left( {{node}\mspace{14mu} 1} \right)}} = {{0{Thus}{\Pr^{\prime}\left\lbrack {{{node}\mspace{14mu} 1}❘{X_{1}\bigwedge X_{2}}} \right\rbrack}} = {{0{and}{\Pr^{\prime}\left\lbrack {{A\mspace{14mu}{root}}❘{X_{1}\bigwedge X_{2}}} \right\rbrack}} = 1}}}$

Consequently the method 12000 exits with node A as a single candidateroot node for a context tree. This context tree is assigned a scoreof 1. After the completion of the method 12000, the method 20000continues and with the second, top-down traversal phase wheredescendants of the root node Y_(i)=A are processed to identify contextnodes among them. This phase begins at step 20010 where P_(j) is firstset to be node 3. Since this node is an ancestor of the hit nodes X₁ andX₂, which are located under two distinct child nodes, execution proceedseventually to step 39050 of method 39000, where the method 17000 isinvoked to determine context nodes among its children. The values Q₁, .. . , Q₅ assigned to the child nodes 6-10 respectively of node 3 due tomethod 17000 are as follows:

$\begin{matrix}{Q_{1} = {{\Pr_{mean}\left\lbrack {{node}\mspace{14mu} 6\text{❘}{X_{1}\bigwedge P_{j}}} \right\rbrack} + {\Pr_{mean}\left\lbrack {{node}{\mspace{11mu}\;}6\text{❘}{X_{2}\bigwedge P_{j}}} \right\rbrack}}} \\{= {\frac{{freq}\left( {{{node}\mspace{14mu} 6},{{node}\mspace{14mu} 3},{{node}\mspace{14mu} 19}} \right)}{{freq}\left( {{{node}\mspace{14mu} 3},{{node}\mspace{14mu} 19}} \right)} + \frac{{freq}\left( {{{node}\mspace{14mu} 6},{{node}\mspace{14mu} 3},{{node}\mspace{14mu} 13}} \right)}{{freq}\left( {{{node}\mspace{14mu} 3},{{node}\mspace{14mu} 13}} \right)}}} \\{= {\frac{1}{1} + \frac{1}{1}}} \\{= 2}\end{matrix}$ $\begin{matrix}{Q_{2} = {{\Pr_{mean}\left\lbrack {{node}\mspace{14mu} 7\text{❘}{X_{1}\bigwedge P_{j}}} \right\rbrack} + {\Pr_{mean}\left\lbrack {{node}{\mspace{11mu}\;}7\text{❘}{X_{2}\bigwedge P_{j}}} \right\rbrack}}} \\{= {\frac{{freq}\left( {{{node}\mspace{14mu} 7},{{node}\mspace{14mu} 3},{{node}\mspace{14mu} 19}} \right)}{{freq}\left( {{{node}\mspace{14mu} 3},{{node}\mspace{14mu} 19}} \right)} + \frac{{freq}\left( {{{node}\mspace{14mu} 7},{{node}\mspace{14mu} 3},{{node}\mspace{14mu} 13}} \right)}{{freq}\left( {{{node}\mspace{14mu} 3},{{node}\mspace{14mu} 13}} \right)}}} \\{= 2}\end{matrix}$ $\begin{matrix}{Q_{3} = {\Pr_{mean}\left\lbrack {{node}\mspace{14mu} 8\text{❘}{X_{1}\bigwedge P_{j}}} \right\rbrack}} \\{= \frac{{freq}\left( {{{node}\mspace{14mu} 8},{{{nod}e}\mspace{14mu} 3},{{node}\mspace{14mu} 19}} \right)}{{freq}\left( {{{node}\mspace{14mu} 3},{{node}\mspace{14mu} 19}} \right)}} \\{= 0}\end{matrix}$ $\begin{matrix}{Q_{4} = {{\Pr_{mean}\left\lbrack {{node}\mspace{14mu} 9\text{❘}{X_{1}\bigwedge P_{j}}} \right\rbrack} + {\Pr_{mean}\left\lbrack {{node}{\mspace{11mu}\;}9\text{❘}{X_{2}\bigwedge P_{j}}} \right\rbrack}}} \\{= {\frac{{freq}\left( {{{node}\mspace{14mu} 9},{{node}\mspace{14mu} 3},{{node}\mspace{14mu} 19}} \right)}{{freq}\left( {{{node}\mspace{14mu} 3},{{node}\mspace{14mu} 19}} \right)} + \frac{{freq}\left( {{{node}\mspace{14mu} 9},{{node}\mspace{14mu} 3},{{node}\mspace{14mu} 13}} \right)}{{freq}\left( {{{node}\mspace{14mu} 3},{{node}\mspace{14mu} 13}} \right)}}} \\{= 1}\end{matrix}$ $\begin{matrix}{Q_{5} = {\Pr_{mean}\left\lbrack {{node}\mspace{14mu}{10}\text{❘}{X_{2}\bigwedge P_{j}}} \right\rbrack}} \\{= \frac{{freq}\left( {{{node}\mspace{14mu} 10},{{node}\mspace{14mu} 3},{{node}\mspace{14mu} 13}} \right)}{{freq}\left( {{{node}\mspace{14mu} 3},{{node}\mspace{14mu} 13}} \right)}} \\{= 0}\end{matrix}$

Thus the set Q₁, . . . , Q₅ sorted in descending order is {Q₁, Q₂, Q₄,Q₃, Q₅} and sums to T=5. The set of context nodes selected by the method17000 thus comprises node 6, node 7 (since Q₁+Q₂>T/2), and node 8, node10 (since they are ancestors of hit nodes). Resuming at step 39055, themethod 39000 then recursively invokes itself to identify context childnodes for each of the selected nodes that have children.

For P_(j)=node 8, execution proceeds to step 39035 since node 8 has asingle descendant hit node (node 13), at which point method 14000 isinvoked to identify context nodes among the set of child nodes 11-14.The probability values Q₁, Q₂, and Q₄ assigned to the child nodes 11,12, and 14 respectively of P_(j) due to the method 14000 are as follows:

$\begin{matrix}{Q_{1} = {\Pr_{mean}\left\lbrack {{node}\mspace{14mu} 11\text{❘}{X_{2}\bigwedge P_{j}}} \right\rbrack}} \\{= \frac{{freq}\left( {{{node}\mspace{14mu} 11},{{node}\mspace{14mu} 8},{{node}\mspace{14mu} 13}} \right)}{{freq}\left( {{{node}\mspace{14mu} 8},{{node}\mspace{14mu} 13}} \right)}} \\{= 1}\end{matrix}$ $\begin{matrix}{Q_{2} = {\Pr_{mean}\left\lbrack {{node}\mspace{14mu}{12}\text{❘}{X_{2}\bigwedge P_{j}}} \right\rbrack}} \\{= \frac{{freq}\left( {{{node}\mspace{14mu} 12},{{node}\mspace{14mu} 8},{{node}\mspace{14mu} 13}} \right)}{{freq}\left( {{{node}\mspace{14mu} 8},{{node}\mspace{14mu} 13}} \right)}} \\{= 1}\end{matrix}$ $\begin{matrix}{Q_{4} = {\Pr_{mean}\left\lbrack {{node}\mspace{14mu}{14}\text{❘}{X_{2}\bigwedge P_{j}}} \right\rbrack}} \\{= \frac{{freq}\left( {{{node}\mspace{14mu} 14},{{node}\mspace{14mu} 8},{{node}\mspace{14mu} 13}} \right)}{{freq}\left( {{{node}\mspace{14mu} 8},{{node}\mspace{14mu} 13}} \right)}} \\{= 1}\end{matrix}$

In addition, the method 14000 also computes a value Q₀ for a fictitiouschild node C₀:

$\begin{matrix}{Q_{0} = {\Pr_{mean}\left\lbrack {{node}\mspace{14mu} 13\mspace{14mu}{no}{\mspace{11mu}\;}{sibling}\text{❘}{X_{2}\bigwedge P_{j}}} \right\rbrack}} \\{= \frac{{freq}\left( {{{node}{\mspace{11mu}\;}8{\;\mspace{11mu}}{has}\mspace{14mu} 1\mspace{14mu}{child}},{{node}\mspace{14mu} 13}} \right)}{{freq}\left( {{{node}\mspace{14mu} 8},{{node}\mspace{14mu} 13}} \right)}} \\{= 0}\end{matrix}$

Thus the set of probability values sorted in descending order is {Q₁,Q₂, Q₄, Q₀} and sums to T=3. The set of context nodes selected by themethod 14000 thus comprises node 11, node 12, node 14 (sinceQ₁+Q₂+Q₄>T/2 and Q₁=Q₂=Q₄), and node 13 (since it is an ancestor of ahit node).

A similar execution path is followed for the case P_(j)=node 10, withsimilar results being obtained. The set of context child nodes of node10 are nodes' 18-22. The schema graph 3600 of the context tree is thusas shown in FIG. 36, comprising the hit nodes 19 and 13, and contextnodes 3, 6-8, 10-14, 18-22. The actual context tree returned to the usercomprising data items represented by these nodes is as follows:

<branch> <name>North Ryde</name> <phone>0291230000</phone> <address><number>1</number> <street>Lane Cove</street> <city>Sydney</city><country>Australia</country> </address> <product> <id>1</id><name>Plasma TV</name> <price>$10000</price> <supplier>JEC</supplier><stock>10</stock> </product> <product> <id>2</id> <name>Mp3player</name> <price>$500</price> <supplier>HG</supplier><stock>20</stock> </product> </branch> <branch> <name>Morley</name><phone>0891230000</phone> <address> <number>1</number><street>Russel</street> <city>Perth</city> <country>Australia</country></address> <product> <id>3</id> <name>Video phone</name><price>$2000</price> <supplier>NVC</supplier> <stock>15</stock></product> <product> <id>4</id> <name>PDA</name> <price>$1000</price><supplier>LP</supplier> <stock>50</stock> </product> </branch>

INDUSTRIAL APPLICABILITY

It is apparent from the above that the arrangements described areapplicable to the computer and data processing industries, andparticularly in respect of presenting information from multiplesearches.

The foregoing describes only some embodiments of the present invention,and modifications and/or changes can be made thereto without departingfrom the scope and spirit of the invention, the embodiments beingillustrative and not restrictive.

(Australia Only) In the context of this specification, the word“comprising” means “including principally but not necessarily solely” or“having” or “including”, and not “consisting only of”. Variations of theword “comprising”, such as “comprise” and “comprises” havecorrespondingly varied meanings.

1. A method of presenting data from a hierarchical data source, saidmethod comprising the steps of: (i) constructing a first view of thehierarchical data source; (ii) obtaining an occurrence probability of atleast one context data from at least the first view of the hierarchicaldata source; (iii) identifying a compulsory entity in the first view;(iv) selecting a context entity from the first view and the context databased on the occurrence probability; and (v) presenting a hierarchicaldata structure, wherein the hierarchical data structure is a subset ofthe hierarchical data source, comprising a plurality of context data,wherein each of the plurality of context data corresponds to theidentified compulsory entity and the selected context entity, whereinthe hierarchical data structure is assigned a score equal to anoccurrence probability of an ancestor node of the compulsory entitygiven the occurrence probability of the context data associated with thecompulsory entity, and the context entity is selected from the groupconsisting of: (a) the ancestor node; (b) a first set of nodes along adirected path in the hierarchical data source from the ancestor node tothe compulsory entity; (c) a second set of nodes selected from adescendent node of the ancestor node in the first view, each of thesecond set of nodes being selected based on a corresponding occurrenceprobability, said occurrence probability being derived from theoccurrence probability of the ancestor node; (d) a third set of nodesselected from a descendent node of the ancestor node in the first viewbased on a corresponding distance from each of the third set of nodes tothe ancestor node in the first view; and (e) a fourth set of nodesselected from a descendent node of the ancestor node in the first viewbased on a corresponding distance from each of the fourth set of nodesto the compulsory entity in the first view.
 2. A method according toclaim 1 wherein the hierarchical data source comprises a schemarepresentation of said at least one data source and at least oneprevious view of the hierarchical data source.
 3. A method according toclaim 1 wherein said context data comprises data ranked according torelevance of said context entities to said compulsory entity.
 4. Amethod according to claim 3 wherein said context data comprises at leastone associated data.
 5. A method according to claim 4 wherein saidassociated data comprises occurrence probability and a plurality ofjoint-occurrence frequencies of entities in said hierarchical datasource observed in a previous view of the hierarchical data source.
 6. Amethod according to claim 1 wherein said second set of nodes comprisesone or more child nodes of at least one parent node in said first viewof the hierarchical data source lying along said directed path from saidancestor node to said compulsory entity.
 7. A method according to claim1 wherein said corresponding distance comprises a number of linksseparating the nodes in said first view of the hierarchical data source.8. A method according to claim 6 wherein, step (iv) comprises selectingsaid child nodes as context entities from all child nodes of said parentnode, said selecting comprising the steps of: (iv-a) computing a firstoccurrence probability of said parent node appearing with none of itschild nodes other than a fifth set of nodes, given an occurrenceprobability of said parent node, the ancestor node and said compulsoryentity, said fifth set comprising at least one child node of said parentnode lying along a directed path from said parent node to saidcompulsory entity; (iv-b) computing a second occurrence probability ofeach of said child nodes in a sixth set of nodes, given the occurrenceprobability of said parent node, the ancestor and said compulsoryentity, said sixth set comprising at least one child node of said parentnode that do not lie along a directed path from said parent node to saidcompulsory entity; (iv-c) computing a total sum of said first occurrenceprobability and said second occurrence probability; (iv-d) creating afictitious node and assigning said fictitious node said first occurrenceprobability; (iv-e) selecting the fifth set of nodes or a seventh set ofnodes as a set of context entities wherein the seventh set of childnodes is formed from said sixth set of child nodes and said fictitiousnode arranged in an order of descending values of said first occurrenceprobability or said second occurrence probability, and wherein a sum ofsaid first occurrence probability or said second occurrenceprobabilities of said seventh set of nodes equals or exceeds half ofsaid total sum; and (iv-f) deselecting as a context entity saidfictitious node if said fictitious node is selected in said seventh setof child nodes, wherein said first occurrence probability and saidsecond occurrence probability are approximated using an occurrenceprobability of a node in said hierarchical data source, a co-occurrenceprobability between a pair of nodes in said hierarchical data source,and joint-occurrence probability between an n-tuple of nodes in saidhierarchical data source observed in said previous view.
 9. A methodaccording to claim 8 wherein said fictitious node prevents other nodes,whose associated probabilities are less than the probability associatedwith the fictitious node, from being selected, since a set of nodes areselected as a set of context entities when the total sum exceeds half ofthe total sum.
 10. A method according to claim 6 wherein, step (iv)comprises selecting said child nodes as context entities from all childnodes of said parent node, said selecting comprising the steps of:(iv-a) computing a first occurrence probability of said parent nodeappearing with none of its child nodes other than a fifth set of nodes,given an occurrence probability of said parent node, the ancestor nodeand said compulsory entity, said fifth set comprising at least one childnode of said parent node lying along a directed path from said parentnode to said compulsory entity; (iv-b) selecting said fifth set of childnodes as a set of context entities; and if said first occurrenceprobability is less than or equal to 0.5: (iv-c) computing, a secondoccurrence probability of each of said child nodes in a sixth set ofnodes, given the occurrence probability of said parent node, theancestor node and said compulsory entity, said sixth set comprising atleast one child node of said parent node that do not lie along adirected path from said parent node to said compulsory entity; (iv-d)computing a total sum of said second occurrence probabilities of saidsecond set of child nodes; (iv-e) selecting as the set of contextentities a seventh set of nodes formed from said sixth set of nodes inan order of descending values of said second occurrence probabilityuntil the sum of said second occurrence probability of said seventh setof child nodes equals or exceeds half of said total sum, wherein saidfirst occurrence probability and said second occurrence probability areapproximated using an occurrence probability of a node in saidhierarchical data structure, co-occurrence probability between a pair ofnodes in said hierarchical data structure, and joint-occurrenceprobability between an n-tuple of nodes in said hierarchical datastructure observed in said previous view.
 11. A method according toclaim 1 wherein said second set of nodes comprises one or more childnodes of at least one parent node in said first view of the hierarchicaldata source not lying along said directed path from said ancestor nodeto said compulsory entity.
 12. A method according to claim 11 wherein,step (iv) comprises selecting said child nodes as a set of contextentities from all child nodes of said parent node, said selectingcomprising the steps of: (iv-a) computing a first occurrence probabilityof said parent node appearing without any of its child nodes given theoccurrence probability of said parent node, the ancestor node and saidcompulsory entity; (iv-b) computing a second occurrence probability ofeach of said child nodes of said parent node given the occurrenceprobability of said parent node the ancestor node and said compulsoryentity; (iv-c) computing a total sum of said first occurrenceprobability and said second occurrence probability of all child nodes ofsaid parent node; (iv-d) creating a fictitious node and assigning saidfictitious node said first occurrence probability; (iv-e) selecting theset of context entities from a set of said fictitious node and all childnodes of said parent node arranged in order of descending values of saidfirst occurrence probability or said second occurrence probabilitiesuntil the sum of said first occurrence probability or said secondoccurrence probability of selected nodes equals or exceeds half of saidtotal sum; and (iv-f) deselecting said fictitious node as a contextentity if said fictitious node is among said selected nodes, whereinsaid first occurrence probability and said second occurrence probabilityare approximated using an occurrence probability of a node in saidhierarchical data source, a co-occurrence probability between a pair ofnodes in said hierarchical data source representation, and ajoint-occurrence probability between an n-tuple of nodes in saidhierarchical data source observed in said previous view.
 13. A methodaccording to claim 11 wherein, step (iv) comprises selecting said childnodes as a set of context entities from all child nodes of said parentnode, said selecting comprising the steps of (iv-a) computing a firstoccurrence probability of said parent node appearing without any of itschild nodes given the occurrence probability of said parent node, theancestor node and said compulsory entity; and if said first occurrenceprobability is less than or equal to 0.5: (iv-b) computing a secondoccurrence probability of each of the child nodes of said parent nodegiven the occurrence probability of said parent node the ancestor nodeand said compulsory entity; (iv-c) computing a total sum of said secondoccurrence probabilities of all child nodes of said parent node, and(iv-d) selecting the set of context entities from the set of all childnodes of said parent node in order of descending values of said secondoccurrence probability until the sum of said second occurrenceprobability of selected nodes equals or exceeds half of said total sum,wherein said first occurrence probability and said second occurrenceprobability are approximated using an occurrence probability of a nodein said hierarchical data source, a co-occurrence probability between apair of nodes in said hierarchical data source, and a joint-occurrenceprobability between an n-tuple of nodes in said hierarchical data sourceobserved in at least one said previous view.
 14. A method according toclaim 1 wherein said compulsory entity represents one of: (i) a locationof one or more search keywords; and (ii) a user-selected entity.
 15. Amethod according to claim 1 wherein said first view of the hierarchicaldata source comprises a tree representation and step (i) or (iii)includes detecting a user's selection of a sub-tree of said first view,and wherein, step (iv) comprises selecting a child node of a parent nodein said user-selected sub-tree, said selecting comprising the steps of:(iv-a) computing a first occurrence probability of said parent nodeappearing without any of its child nodes given the occurrenceprobability of said parent node, and the ancestor node of saiduser-selected sub-tree; (iv-b) computing a second occurrence probabilityof each of said child nodes of said parent node given the occurrenceprobability of said parent node, and the ancestor node of saiduser-selected sub-tree; (iv-c) computing a total sum of said firstoccurrence probability and said second occurrence probability of allchild nodes of said parent node; (iv-d) creating a fictitious node andassigning said fictitious node said first occurrence probability; (iv-e)selecting the context entity from the set of said fictitious node andall child nodes of said parent node in order of descending values ofsaid first occurrence probability or said second occurrence probabilityuntil the sum of said first occurrence probability or said secondoccurrence probability of selected nodes equals or exceeds half of saidtotal sum; and (iv-f) deselecting said fictitious node if saidfictitious node is among said selected nodes.
 16. A method according toclaim 1 wherein said first view of the hierarchical data sourcecomprises a tree representation and step (i) or (iii) includes detectinga user's selection of a sub-tree of said first view, and wherein, (iv)comprises selecting a child node of a parent node in said user-selectedsub-tree, said selecting comprising the steps of: (iv-a) computing afirst occurrence probability of said parent node appearing without anyof its child nodes given the occurrence probability of said parent node,and the ancestor node of said user-selected sub-tree; if said firstoccurrence probability is less than or equal to 0.5 (iv-b) computing asecond occurrence probability of each of said child node of said parentnode given the occurrence probability of said parent node, and theancestor of said user-selected sub-tree; (iv-c) computing a total sum ofsaid second occurrence probability of all child nodes of said parentnode; and (iv-d) selecting the context entity from the set of all childnodes of said parent node in order of descending values of said secondoccurrence probability until the sum of said second occurrenceprobability of selected nodes equals or exceeds half of said total sum.17. A method of construction and presentation of data for a keywordsearching operation in a hierarchical data source involving searchkeyword, said method comprising the steps of: (i) constructing agraphical first view of the hierarchical data source; (ii) identifying acompulsory entity in said graphical first view, wherein said compulsoryentity is a node in said graphical first view representing a location ofsaid search keyword; (iii) obtaining an occurrence probability of atleast one context data from at least the first view of the hierarchicaldata source; (iv) constructing a hierarchical data structure, whereinthe hierarchical data structure is a subset of the hierarchical datasource comprising said compulsory entity and one or more contextentities corresponding to the search keyword, wherein said contextentities are obtained from said graphical first view using the contextdata and the occurrence probability; and (v) presenting saidhierarchical data structure as a result of said keyword searchingoperation; wherein the hierarchical data structure is assigned a scoreequal to an occurrence probability of an ancestor node of the compulsoryentity given the occurrence probability of the context data associatedwith the compulsory entity; and the context entity is selected from thegroup consisting of: (a) the ancestor node; (b) a first set of nodesalong a directed path in the hierarchical data source from the ancestornode to the compulsory entity; (c) a second set of nodes selected from adescendent node of the ancestor node in the first view, each of thesecond set of nodes being selected based on a corresponding occurrenceprobability, said occurrence probability being derived from theoccurrence probability of the ancestor node; (d) a third set of nodesselected from a descendent node of the ancestor node in the first viewbased on a corresponding distance from each of the third set of nodes tothe ancestor node in the first view; and (e) a fourth set of nodesselected from a descendent node of the ancestor node in the first viewbased on a corresponding distance from each of the fourth set of nodesto the compulsory entity in the first view.
 18. A computer readablestorage medium, having a computer-executable program recorded thereon,wherein the program is configured to make a computer execute a procedureto present data from a hierarchical data source, said programcomprising: (i) code for constructing a first view of the hierarchicaldata source; (ii) code for obtaining an occurrence probability of atleast one context data from at least the first view of the hierarchicaldata source; (iii) code for identifying a compulsory entity in the firstview; (iv) code for selecting one context entity from the first view andthe context data based on the occurrence probability; and (v) code forpresenting a hierarchical data structure, wherein the hierarchical datastructure is a subset of the hierarchical data source, comprising aplurality of context data, wherein each of the plurality of context datacorresponds to the identified compulsory entity and the selected contextentity; wherein the hierarchical data structure is assigned a scoreequal to an occurrence probability of an ancestor node of the compulsoryentity given the occurrence probability of the context data associatedwith the compulsory entity, and the context entity is selected from thegroup consisting of: (a) the ancestor node; (b) a first set of nodesalong a directed path in the hierarchical data source from the ancestornode to the compulsory entity; (c) a second set of nodes selected from adescendent node of the ancestor node in the first view, each of thesecond set of nodes being selected based on a corresponding occurrenceprobability, said occurrence probability being derived from theoccurrence probability of the ancestor node; (d) a third set of nodesselected from a descendent node of the ancestor node in the first viewbased on a corresponding distance from each of the third set of nodes tothe ancestor node in the first view; and (e) a fourth set of nodesselected from a descendent node of the ancestor node in the first viewbased on a corresponding distance from each of the fourth set of nodesto the compulsory entity in the first view.
 19. A computer readablestorage medium, having a computer-executable program recorded thereon,wherein the program is configured to make a computer execute a procedureto construct and present data for a keyword searching operation in ahierarchical data source involving a search keyword, said programcomprising: (i) code for constructing a first view of the hierarchicaldata source; (ii) code for identifying a compulsory entity in said firstview, wherein said compulsory entity is a node in said first viewrepresenting a location of said search keyword; (ii) obtaining anoccurrence probability of at least one context data from at least thefirst view of the hierarchical data source; (iv) code for constructing ahierarchical data structure, wherein the hierarchical data structure isa subset of the hierarchical data source, comprising said compulsoryentity and one or more context entities, wherein said context entitiesare obtained from said first view using the context data and theoccurrence probability; and (v) code for presenting said hierarchicaldata structure as a result of said keyword searching operation, whereinthe hierarchical data structure is assigned a score equal to anoccurrence probability of an ancestor node of the compulsory entitygiven the occurrence probability of the context data associated with thecompulsory entity, and the context entity is selected from the groupconsisting of: (a) the ancestor node; (b) a first set of nodes along adirected path in the hierarchical data source from the ancestor node tothe compulsory entity; (c) a second set of nodes selected from adescendent node of the ancestor node in the first view, each of thesecond set of nodes being selected based on a corresponding occurrenceprobability, said occurrence probability being derived from theoccurrence probability of the ancestor node; (d) a third set of nodesselected from a descendent node of the ancestor node in the first viewbased on a corresponding distance from each of the third set of nodes tothe ancestor node in the first view; and (e) a fourth set of nodesselected from a descendent node of the ancestor node in the first viewbased on a corresponding distance from each of the fourth set of nodesto the compulsory entity in the first view.
 20. Computer apparatus forconstructing at least one data structure from a hierarchical datasource, said apparatus comprising a constructing module configured toconstruct a first view of said hierarchical data source; an obtainingmodule configured to obtain an occurrence probability of at least onedata element from at least the first view of the hierarchical datasource; an identifying module configured to identify compulsory entityin the first view; a selecting module configured to select a contextentity from the first view and the context data based on the occurrenceprobability; and a presenting module configured to present ahierarchical data structure, wherein the hierarchical data structure isa subset of the hierarchical data source, comprising a plurality ofcontext data, wherein each of the plurality of context data correspondsto the identified compulsory entity and the selected context entity;wherein the hierarchical data structure is assigned a score equal to anoccurrence probability of an ancestor node of the compulsory entitygiven the occurrence probability of the context data associated with thecompulsory entity; and the context entity is selected from the groupconsisting of: (a) the ancestor node; (b) a first set of nodes along adirected path in the hierarchical data source from the ancestor node tothe compulsory entity; (c) a second set of nodes selected from adescendent node of the ancestor node in the first view, each of thesecond set of nodes being selected based on a corresponding occurrenceprobability, said occurrence probability being derived from theoccurrence probability of the ancestor node; (d) a third set of nodesselected from a descendent node of the ancestor node in the first viewbased on a corresponding distance from each of the third set of nodes tothe ancestor node in the first view; and (e) a fourth set of nodesselected from a descendent node of the ancestor node in the first viewbased on a corresponding distance from each of the fourth set of nodesto the compulsory entity in the first view.
 21. Computer apparatus forconstruction and presentation of data for a keyword searching operationin a hierarchical data source involving a search keyword, said apparatuscomprising: a constructing module configured to construct a first viewof the hierarchical data source; an identifying module configured toidentify a compulsory entity in said first view, wherein said compulsoryentity is a node in said first view representing a location of searchkeyword; an obtaining module configured to obtain an occurrenceprobability of at least one context data from at least the first view ofthe hierarchical data source; a determining module configured to selecta context entity from said first view and the occurrence probabilitycontext data obtained; a constructing module configured to construct ahierarchical data structure, wherein the hierarchical data structure isa subset of the hierarchical data source comprising said compulsoryentity and said context entity; and a presenting module configured topresent said hierarchical data structure comprising said compulsoryentity and said context entity as a result of said keyword searchingoperation, wherein the hierarchical data structure is assigned a scoreequal to an occurrence probability of an ancestor node of the compulsoryentity given the occurrence probability of the context data associatedwith the compulsory entity, and the context entity is selected from thegroup consisting of: (a) the ancestor node; (b) a first set of nodesalong a directed path in the hierarchical data source from the ancestornode to the compulsory entity; (c) a second set of nodes selected from adescendent node of the ancestor node in the first view, each of thesecond set of nodes being selected based on a corresponding occurrenceprobability, said occurrence probability being derived from theoccurrence probability of the ancestor node; (d) a third set of nodesselected from a descendent node of the ancestor node in the first viewbased on a corresponding distance from each of the third set of nodes tothe ancestor node in the first view; and (e) a fourth set of nodesselected from a descendent node of the ancestor node in the first viewbased on a corresponding distance from each of the fourth set of nodesto the compulsory entity in the first view.
 22. A method according toclaim 2 wherein said schema representation is updated as at least onenew query is logged.